On Mon, 2 Feb 2004 [EMAIL PROTECTED] wrote:
> >There is some sense in this. The same sort of slovenly implementation
> >which might treat 0xC0 0xAF (non-minimal encoding) as '/' sometimes but
> >not always, might well also treat 0xFD 0x80 0x80 0x80 0x80 0xAF (code
> >point far outside the Unicode range) as '/' sometimes but not always...
>
> These are two separate issues, and should not be tied together as such.
Perhaps not, but they were both outlawed in the same revision of the specs.
> Overcoded sequences are clearly invalid.
Actually, I'm one of the people who never thought this was "clear". Many
of our information coding schemes have multiple ways of representing the
same information.
> >Now, not everyone agrees that trying to fix *either* of these problems by
> >standards engineering was a sensible approach, but there is no doubt that
> >it *was* done and the current standards *do* call for it.
>
> Do you think that an I/O layer should check for high-surrogate codepoints
> encoded into UTF-8 and perform some arbitrary action on those?
If you carefully read the current official definition of UTF-8, you will
find that code points between 0xD800 and 0xDFFF -- the surrogates, both
low and high -- *cannot* be encoded in standard-conforming UTF-8. E.g.,
0xED 0xA0 0x80 is an ill-formed sequence, which a standard-conforming
decoder is required to reject. See Unicode 4.0, table 3.6, "Well-formed
UTF-8 byte sequences".
Again, you or I might not *approve*, but there is no question of what the
UTF-8 spec now says. An implementation which does not conform to the spec
should not use a label which implies that it does.
> What about U00FFFF and U00FFFE...
I look at the standard, and it says "These codes are intended for process
internal uses, but are not permitted for interchange." In other words, if
you see those in input, something's wrong. How you respond to something
wrong in input is your call, but quietly ignoring it is rarely wise.
(Past experience among folks I know has generally been that substituting a
"something not kosher here" marker, such as U+FFFD, is preferable -- as a
*default* response -- to either interrupting processing or just passing
the bad input on.)
> or a stray BOM in the middle of some text?
The spec is quite clear that U+FEFF showing up at a random location is a
"zero width no-break space", semantically equivalent to U+2060, although
this use of U+FEFF is now discouraged.
Henry Spencer
[EMAIL PROTECTED]
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/