On Sat, Feb 11, 2006 at 01:44:39PM +0000, Markus Kuhn wrote: > Bruno Haible wrote on 2006-01-19 11:50 UTC: > > Rich Felker wrote: > > > hope this isn't too off-topic -- i'm working on a utf-8 implementation > > > and trying to decide what to do with byte sequences that are > > > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff, > > > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as > > > illegal sequences (EILSEQ) or decoded as ordinary characters? is there > > > a good reference on the precedents?
Since asking, I've been reading lots of standards material and talking to several other people... > > The three cases are probably best treated separately: > > > > - The range 0xd800-0xdfff. [...] > > - 0xfffe-0xffff [...] > > - The range >= 0x110000 > > When I wrote the UTF-8 validator routine > > http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c Wow, you should really try to cut down on the number ofconditionals there. It will kill performance due to mispredicted branches (with that many I think you'll overflow the predictions even). > after having given the issue some thought, I decided to reject all of > the above without any further discrimination. It just makes things much > simpler and cleaner should there ever be any UTF-16 conversion > afterwards, if such problem sequences are caught as early as possible. I tend to disagree. These code points are valid 'Unicode scalar values' in the wording of the Unicode standard, and if you wanted to forbid noncharacter codepoints there are several others as well. > In particular: > > - 0xfffe could be misinterpreted by a later process as an anti-BOM, and This could be an issue on Windows/Java, but neither BOMs nor UTF-16 are used on unix systems, for very good reason. > - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations. Any implementation with sizeof(wchar_t) is not ISO C compliant, unless its adopted subset of UCS/Unicode is the 16bit portion only (i.e. not using UTF-16). The standard says that wchar_t represents a whole character and that there are no such things as multi-wchar_t characters or shift states for wchar_t. I agree that noncharacter code points should not be written to text files for interchange with other applications; however, they are valid for internal use, and must remain valid for internal use invariant under conversions between different encoding forms. Forbidding them belongs at the level of the file writer in an application generating files for interchange, not at the mbrtowc/wcrtomb level. The latter could interfere with legitimate internal processing, and just slows UTF-8 processing down even more. Rich P.S. Did your old thread about converting invalid bytes to high surrogate codepoints (for binary-clean in-band error reporting) when decoding UTF-8 ever reach a conclusion? I found part of the thread while doing a search, but didn't see where it ended up. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
