Re: utf-8 and well-formed but illegal chars

Markus Kuhn Sat, 11 Feb 2006 05:46:07 -0800

Bruno Haible wrote on 2006-01-19 11:50 UTC:
> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
> 
> The three cases are probably best treated separately:
> 
> - The range 0xd800-0xdfff. [...]
> - 0xfffe-0xffff [...]
> - The range >= 0x110000


When I wrote the UTF-8 validator routine

  http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c

after having given the issue some thought, I decided to reject all of
the above without any further discrimination. It just makes things much
simpler and cleaner should there ever be any UTF-16 conversion
afterwards, if such problem sequences are caught as early as possible.

In particular:

  - 0xfffe could be misinterpreted by a later process as an anti-BOM, and
  - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to