Followup to:  <[EMAIL PROTECTED]>
By author:    Markus Kuhn <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> Is there a proper full specification of this encoding somewhere
> online? Merely replacing 0x00 with its overlong UTF-8 equivalent
> 0xc0 0x80 can't be the full story, because what you are interested
> in the end must surely be binary transparency, not merely
> NUL-transparency. I don't see what NUL-transparency alone would
> be good for, as NUL is usually only a problem in arbitrary binary
> strings.
> 

I think it is meant to be NUL-transparent.  Java apparently wants to
be able to treat \0 as just any Unicode character.

It would in fact be trivial to extend UTF-8 to support any arbitrary
sequence of 32-bit binary units (just use the original definition of
UTF-8, ignoring any of the UTF-16-imposed crap, and use FE and FF as
initiators of 6-byte sequences for the range 80000000 to FFFFFFFF.)
Supporting any byte string is trickier, since it's not clear there is
any kind of sensible meaning for that on the wide character side.

> So you also have to specify how to represent any byte sequence
> including overlong UTF-8 sequences such as 0xc0 0x80. Until someone
> shows me the full spec behind this frequently quoted but unnamed
> Java derivative of UTF-8, I am not yet convinced that it is useful
> for anything in practice.

        -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to