On 12 Jul 2000, H. Peter Anvin wrote:
> > 1-byte characters 0xxxxxxx
> > 2-byte characters 110xxxxx 10xxxxxx
>
> ...It is also very important to realize that ONLY THE SHORTEST POSSIBLE
> SEQUENCE IS LEGAL...
It's worth noting that some implementations which generally hew pretty
close to this make one exception: they represent ASCII NUL (U+0000) as
11000000 10000000, so that 00000000 can be used as a terminator within
programs without worrying that it will collide with a user character.
This *is* a violation of the rules, and one would hope that it will stay
a program-internal convention, but it may well sneak out into files.
> This is incredibly important, since any misguided
> attempt to "be liberal in what you accept" without addition of an
> explicit canonicalization step would lead to the kind of security
> holes that Microsoft web-related applications have been so full of...
There is somewhat of a contradiction here: the usual place to "be liberal
in what you accept" is a decoder, which has an inherent canonicalization
step. In practice, I think the only place where you make a meaningful
*choice* as to whether to be liberal or not is when canonicalization is
imminent anyway. Someone who is comparing the raw UTF-8 sequences is
making a mistake, yes, but it's not a "be liberal in what you accept"
mistake, it's an error in choice of working representation.
> The alternate spelling
> 11000001 10001011
> ... is not the character K <U+004B> but INVALID SEQUENCE. One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.
Unless you are Bill Gates and have the power to decree that your users
*will* use your preferred decoder, this may be a mistake. Remember that
the users of a decoder see no advantage from this behavior, since they are
canonicalizing anyway.
Henry Spencer
[EMAIL PROTECTED]
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/