Re: Perl & unicode weirdness.

jmaiorana Mon, 02 Feb 2004 16:14:41 -0800

There is some sense in this. The same sort of slovenly implementation which might treat 0xC0 0xAF (non-minimal encoding) as '/' sometimes but not always, might well also treat 0xFD 0x80 0x80 0x80 0x80 0xAF (code point far outside the Unicode range) as '/' sometimes but not always. If you think it is best to restrict the spec to fix the first problem (as opposed to, say, shooting the incompetent programmer), restricting it further to fix the second is also reasonable.


These are two separate issues, and should not be tied together as such.
Overcoded sequences are clearly invalid. "Out-of-range" sequences are
different:

Now, not everyone agrees that trying to fix *either* of these problems by standards engineering was a sensible approach, but there is no doubt that it *was* done and the current standards *do* call for it.

Do you think that an I/O layer should check for high-surrogate codepoints
encoded into UTF-8 and perform some arbitrary action on those? What about
U00FFFF and U00FFFE, or a stray BOM in the middle of some text?

I dont think it's the job of an I/O layer to make such decisions. The
maximum value of the code point range is no different than defining
0x110000-0x7FFFFFFF to be "not a character". I highly doubt that most
existing implementations are truly as pedantic as the standard could
be interpreted to call for. Prematurely handling code point filtering I
would consider to be splitting the hair on the wrong end.

Now, on the other hand, delivering a document marked as UTF-8 which
contains any of the above mentioned code-points would be incorrect.
A document is supposed to be a finished product, fully filtered
and normalized.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl & unicode weirdness.

Reply via email to