Bruno Haible wrote on 2006-01-19 11:50 UTC: > Rich Felker wrote: > > hope this isn't too off-topic -- i'm working on a utf-8 implementation > > and trying to decide what to do with byte sequences that are > > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff, > > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as > > illegal sequences (EILSEQ) or decoded as ordinary characters? is there > > a good reference on the precedents? > > The three cases are probably best treated separately: > > - The range 0xd800-0xdfff. [...] > - 0xfffe-0xffff [...] > - The range >= 0x110000
When I wrote the UTF-8 validator routine http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c after having given the issue some thought, I decided to reject all of the above without any further discrimination. It just makes things much simpler and cleaner should there ever be any UTF-16 conversion afterwards, if such problem sequences are caught as early as possible. In particular: - 0xfffe could be misinterpreted by a later process as an anti-BOM, and - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/