Re: utf-8 and well-formed but illegal chars

Bruno Haible Thu, 19 Jan 2006 05:29:53 -0800

Rich Felker wrote:
> hope this isn't too off-topic -- i'm working on a utf-8 implementation
> and trying to decide what to do with byte sequences that are
> well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> a good reference on the precedents?


The three cases are probably best treated separately:

- The range 0xd800-0xdfff. You should catch and reject them as invalid when
  you are programming a conversion to UCS-2 or UTF-16, for example
    UTF-8 -> UTF-16
  or
    UCS-4 -> UTF-16
  Otherwise it becomes possible for malicious users to create non-BMP
  characters at a level of processing where earlier stages of processing
  did not see them.

  In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.

- For the other two ranges, the advice is dictated merely by consistency.

  Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
  therefore there is no need to catch them.

  The range >= 0x110000, I would catch and reject as invalid. Some time ago
  I had a crash in an application because the first level of processing
  rejected only values >= 0x80000000, with a reasonable error message, and
  later processing relied on valid Unicode and called abort() when a
  character code >= 0x110000 was seen. Making the first level as strict
  as the later one fixed this.

Bruno


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to