Re: utf-8 and well-formed but illegal chars

Rich Felker Fri, 20 Jan 2006 07:37:16 -0800

On Thu, Jan 19, 2006 at 12:50:09PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
> 
> The three cases are probably best treated separately:
> 
> - The range 0xd800-0xdfff. You should catch and reject them as invalid when
>   you are programming a conversion to UCS-2 or UTF-16, for example
>     UTF-8 -> UTF-16
>   or
>     UCS-4 -> UTF-16
>   Otherwise it becomes possible for malicious users to create non-BMP
>   characters at a level of processing where earlier stages of processing
>   did not see them.
> 
>   In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.


Thanks for the comments. Actually you've convinced me though that the
UTF-16 surrogates do always need to be treated as errors, since the
user may always later submit the decoded UCS numbers ("UCS-4") to a
buggy UTF-16 implementation (or a pre-UTF-16 UCS-2 writer).

Moreover there's nothing valid these characters can possibly mean..

> - For the other two ranges, the advice is dictated merely by consistency.
> 
>   Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
>   therefore there is no need to catch them.

I was thinking it would be good to reject them for detecting non-UTF-8
data more reliably, but the sequences [ef bf be] and [ef bf bf] are
extremely unlikely in any other encoding as far as I know so it's
probably just a useless performance hit to check for them.

Good to know the precedent here.

>   The range >= 0x110000, I would catch and reject as invalid. Some time ago
>   I had a crash in an application because the first level of processing
>   rejected only values >= 0x80000000, with a reasonable error message, and
>   later processing relied on valid Unicode and called abort() when a
>   character code >= 0x110000 was seen. Making the first level as strict
>   as the later one fixed this.

I agree. What's worse, someone may try to use UCS character numbers as
an index into a lookup table (large table but still a possibility..)
without checking that they're in range (assuming the decoder will only
output valid numbers), with disastrous results.

Thanks again for your suggestions.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to