Re: utf-8 and well-formed but illegal chars

Rich Felker Sat, 11 Feb 2006 11:33:58 -0800

On Sat, Feb 11, 2006 at 01:44:39PM +0000, Markus Kuhn wrote:
> Bruno Haible wrote on 2006-01-19 11:50 UTC:
> > Rich Felker wrote:
> > > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > > and trying to decide what to do with byte sequences that are
> > > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > > a good reference on the precedents?


Since asking, I've been reading lots of standards material and talking
to several other people...

> > The three cases are probably best treated separately:
> > 
> > - The range 0xd800-0xdfff. [...]
> > - 0xfffe-0xffff [...]
> > - The range >= 0x110000
> 
> When I wrote the UTF-8 validator routine
> 
>   http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c

Wow, you should really try to cut down on the number ofconditionals
there. It will kill performance due to mispredicted branches (with
that many I think you'll overflow the predictions even).

> after having given the issue some thought, I decided to reject all of
> the above without any further discrimination. It just makes things much
> simpler and cleaner should there ever be any UTF-16 conversion
> afterwards, if such problem sequences are caught as early as possible.

I tend to disagree. These code points are valid 'Unicode scalar
values' in the wording of the Unicode standard, and if you wanted to
forbid noncharacter codepoints there are several others as well.

> In particular:
> 
>   - 0xfffe could be misinterpreted by a later process as an anti-BOM, and

This could be an issue on Windows/Java, but neither BOMs nor UTF-16
are used on unix systems, for very good reason.

>   - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.

Any implementation with sizeof(wchar_t) is not ISO C compliant, unless
its adopted subset of UCS/Unicode is the 16bit portion only (i.e. not
using UTF-16). The standard says that wchar_t represents a whole
character and that there are no such things as multi-wchar_t
characters or shift states for wchar_t.

I agree that noncharacter code points should not be written to text
files for interchange with other applications; however, they are valid
for internal use, and must remain valid for internal use invariant
under conversions between different encoding forms. Forbidding them
belongs at the level of the file writer in an application generating
files for interchange, not at the mbrtowc/wcrtomb level. The latter
could interfere with legitimate internal processing, and just slows
UTF-8 processing down even more.

Rich


P.S. Did your old thread about converting invalid bytes to high
surrogate codepoints (for binary-clean in-band error reporting) when
decoding UTF-8 ever reach a conclusion? I found part of the thread
while doing a search, but didn't see where it ended up.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to