On Mon, 2 Feb 2004 [EMAIL PROTECTED] wrote:
> I personally think filtering the code-point range is a separate concern
> from encoding itself. I dont think you would want a utf-32 input stream
> to start dropping words just because they exceed 0x10FFFF.
The problem is that UTF-8 is now explicitly defined (Unicode 4.0, pp 77-8,
see also RFC 3629) as encoding only code points up to 0x10ffff. As part
of the tightening of the encoding so that there is only one well-formed way
to encode any particular code point, the older forms which encoded higher
code points are now also explicitly deemed to be ill-formed, violations of
the spec.
There is some sense in this. The same sort of slovenly implementation
which might treat 0xC0 0xAF (non-minimal encoding) as '/' sometimes but
not always, might well also treat 0xFD 0x80 0x80 0x80 0x80 0xAF (code
point far outside the Unicode range) as '/' sometimes but not always.
If you think it is best to restrict the spec to fix the first problem (as
opposed to, say, shooting the incompetent programmer), restricting it
further to fix the second is also reasonable.
Now, not everyone agrees that trying to fix *either* of these problems by
standards engineering was a sensible approach, but there is no doubt that
it *was* done and the current standards *do* call for it.
An encoding which permits the higher-code-point forms can no longer be
properly spoken of as UTF-8. Given the modern definition of UTF-8, such
an encoding is an *extension* of UTF-8, and ought to be labeled as such to
avoid confusion. (The confusion inevitably caused by the change in the
meaning of "UTF-8" is unfortunate, but no longer avoidable. The best we
can do is to avoid adding to it.)
And the label for the extension preferably should not be too similar to
"UTF-8", again to avoid confusion. Call it UTF-P, or UTF-8P, or UTF-9,
but not "utf8", please.
Henry Spencer
[EMAIL PROTECTED]
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/