Marcin 'Qrczak' Kowalczyk wrote on 2000-09-05 07:20 UTC:
> >  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
> 
> libiconv is better but sometimes returns more U+FFFD characters than
> recommended there.

As we discussed before here, both

  - one U+FFFD per malformed sequence (as suggested in the above test file)
  - one U+FFFD per byte of a malformed sequence

are practical and reasonable choices.

> > It now contains an additional section 5 with UTF-8 sequences for
> > illegal code positions that a good decoder should reject (surrogates,
> > U+FFFE, U+FFFF) like overlong and malformed sequences for security
> > reasons, as well as all the relevant legal boundary conditions
> > for these.
> 
> Should they be rejected by decoders of other formats when applicable,
> e.g. U+FFFF in UTF-16 or surrogates in UCS-4?

That might be a good ideas as well, because all these can cause problems
in later stages of a processing pipeline:

  - surrogates can lead to alternative representations of a Unicode
    characters if there is a UTF-16 decoder to follow somewhere
  - U+FFFF is used internally by some application with a special
    meaning (e.g., WEOF = 0xffff on some systems with sizeof(wchar_t) == 2,
    or also xterm)
  - U+FFFE is the anti-BOM, which might trigger a misplaced byte-swap
    if found by a UTF-16 decoder later in the processing pipeline.
  - ISO 10646-1 says explicitly that none of these should appear in
    any encoding.

I have not yet formed an opinion on whether characters > 0x10FFFF should
be rejected. Apparently ISO 10646-1:2000 deprecates these now because
they don't fit into UTF-16, but I haven't received a copy of the new ISO
standard yet and want to read the precise text first before I form an
opinion on that one.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to