Marcin 'Qrczak' Kowalczyk wrote on 2000-09-05 07:20 UTC:
> > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>
> libiconv is better but sometimes returns more U+FFFD characters than
> recommended there.
As we discussed before here, both
- one U+FFFD per malformed sequence (as suggested in the above test file)
- one U+FFFD per byte of a malformed sequence
are practical and reasonable choices.
> > It now contains an additional section 5 with UTF-8 sequences for
> > illegal code positions that a good decoder should reject (surrogates,
> > U+FFFE, U+FFFF) like overlong and malformed sequences for security
> > reasons, as well as all the relevant legal boundary conditions
> > for these.
>
> Should they be rejected by decoders of other formats when applicable,
> e.g. U+FFFF in UTF-16 or surrogates in UCS-4?
That might be a good ideas as well, because all these can cause problems
in later stages of a processing pipeline:
- surrogates can lead to alternative representations of a Unicode
characters if there is a UTF-16 decoder to follow somewhere
- U+FFFF is used internally by some application with a special
meaning (e.g., WEOF = 0xffff on some systems with sizeof(wchar_t) == 2,
or also xterm)
- U+FFFE is the anti-BOM, which might trigger a misplaced byte-swap
if found by a UTF-16 decoder later in the processing pipeline.
- ISO 10646-1 says explicitly that none of these should appear in
any encoding.
I have not yet formed an opinion on whether characters > 0x10FFFF should
be rejected. Apparently ISO 10646-1:2000 deprecates these now because
they don't fit into UTF-16, but I haven't received a copy of the new ISO
standard yet and want to read the precise text first before I form an
opinion on that one.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/