On Sun, 22 Dec 2002, Markus Kuhn wrote:
> > It's actually quite convenient to be able to make applications 
> > NUL-transparent without having to recode all the string operations.
> 
> Is there a proper full specification of this encoding somewhere
> online? Merely replacing 0x00 with its overlong UTF-8 equivalent
> 0xc0 0x80 can't be the full story...

Depends on what you are doing.  Yes, there's a remaining issue with byte
sequences that don't look like valid UTF-8.  There is similarly an issue,
though, with things like normalization in more aggressive applications,
vs. binary files which should not be normalized.  And whether the input
sequence 0xc3 0x80 is U+00C0 or U+00C3 U+0080 also depends on intent --
someone who wants to edit a binary file will want it presented to him as
the latter, not the former.  In the general case, the application really
has to *know* whether the input is binary or text (and in the text case,
what kind of text).  (How it determines this is a different question.)

Input processing of binary can treat it as text encoded in "UCS-1" -- that
is, it's characters U+0000 through U+00FF, one byte apiece -- and then the
*only* problem is how you represent U+0000 internally without using the
magic 0x00 octet.  For that, the simple solution is 0xc0 0x80, and so yes,
that is the full story. 

How errors in input that is allegedly UTF-8 text are handled is a
different issue.  It doesn't necessarily involve an attempt to encode the
input byte sequence precisely and reversibly. 

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to