On Thu, Jan 19, 2006 at 12:50:09PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
>
> The three cases are probably best treated separately:
>
> - The range 0xd800-0xdfff. You should catch and reject them as invalid when
> you are programming a conversion to UCS-2 or UTF-16, for example
> UTF-8 -> UTF-16
> or
> UCS-4 -> UTF-16
> Otherwise it becomes possible for malicious users to create non-BMP
> characters at a level of processing where earlier stages of processing
> did not see them.
>
> In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.
Thanks for the comments. Actually you've convinced me though that the
UTF-16 surrogates do always need to be treated as errors, since the
user may always later submit the decoded UCS numbers ("UCS-4") to a
buggy UTF-16 implementation (or a pre-UTF-16 UCS-2 writer).
Moreover there's nothing valid these characters can possibly mean..
> - For the other two ranges, the advice is dictated merely by consistency.
>
> Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
> therefore there is no need to catch them.
I was thinking it would be good to reject them for detecting non-UTF-8
data more reliably, but the sequences [ef bf be] and [ef bf bf] are
extremely unlikely in any other encoding as far as I know so it's
probably just a useless performance hit to check for them.
Good to know the precedent here.
> The range >= 0x110000, I would catch and reject as invalid. Some time ago
> I had a crash in an application because the first level of processing
> rejected only values >= 0x80000000, with a reasonable error message, and
> later processing relied on valid Unicode and called abort() when a
> character code >= 0x110000 was seen. Making the first level as strict
> as the later one fixed this.
I agree. What's worse, someone may try to use UCS character numbers as
an index into a lookup table (large table but still a possibility..)
without checking that they're in range (assuming the decoder will only
output valid numbers), with disastrous results.
Thanks again for your suggestions.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/