Re: utf-8 encoding scheme

Henry Spencer Wed, 12 Jul 2000 12:34:17 -0700
On 12 Jul 2000, H. Peter Anvin wrote:
> >   1-byte characters 0xxxxxxx 
> >   2-byte characters 110xxxxx 10xxxxxx
> 
> ...It is also very important to realize that ONLY THE SHORTEST POSSIBLE
> SEQUENCE IS LEGAL...

It's worth noting that some implementations which generally hew pretty
close to this make one exception:  they represent ASCII NUL (U+0000) as
11000000 10000000, so that 00000000 can be used as a terminator within
programs without worrying that it will collide with a user character. 
This *is* a violation of the rules, and one would hope that it will stay
a program-internal convention, but it may well sneak out into files.

> This is incredibly important, since any misguided
> attempt to "be liberal in what you accept" without addition of an
> explicit canonicalization step would lead to the kind of security
> holes that Microsoft web-related applications have been so full of...

There is somewhat of a contradiction here:  the usual place to "be liberal
in what you accept" is a decoder, which has an inherent canonicalization
step.  In practice, I think the only place where you make a meaningful
*choice* as to whether to be liberal or not is when canonicalization is
imminent anyway.  Someone who is comparing the raw UTF-8 sequences is
making a mistake, yes, but it's not a "be liberal in what you accept"
mistake, it's an error in choice of working representation. 

> The alternate spelling
>       11000001 10001011
> ... is not the character K <U+004B> but INVALID SEQUENCE.  One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.

Unless you are Bill Gates and have the power to decree that your users
*will* use your preferred decoder, this may be a mistake.  Remember that
the users of a decoder see no advantage from this behavior, since they are
canonicalizing anyway.

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: utf-8 encoding scheme

Reply via email to