Re: utf-8 and well-formed but illegal chars

Rich Felker Fri, 20 Jan 2006 07:37:23 -0800

On Thu, Jan 19, 2006 at 12:50:09PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
> 
> The three cases are probably best treated separately:
> 
> - The range 0xd800-0xdfff. You should catch and reject them as invalid when
>   you are programming a conversion to UCS-2 or UTF-16, for example
>     UTF-8 -> UTF-16
>   or
>     UCS-4 -> UTF-16
>   Otherwise it becomes possible for malicious users to create non-BMP
>   characters at a level of processing where earlier stages of processing
>   did not see them.
> 
>   In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.
> 
> - For the other two ranges, the advice is dictated merely by consistency.
> 
>   Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
>   therefore there is no need to catch them.
> 
>   The range >= 0x110000, I would catch and reject as invalid. Some time ago
> [...]


To follow up in case anyone cares: the Unicode standard agrees with
what you've said, except that 0xd800-0xdfff should always be rejected:

----------------------------------------------------------------------
D28 Unicode scalar value: Any Unicode code point except high-surrogate
    and low-surrogate code points.
     * As a result of this definition, the set of Unicode scalar
       values consists of the ranges 0 to D7FF and E000 to 10FFFF,
       inclusive.

D29 A Unicode encoding form assigns each Unicode scalar value to a
    unique code unit sequence.
----------------------------------------------------------------------

The standard goes on to clarify that an encoding form maps ALL Unicode
scalar values to code unit sequences, including noncharacter code
points and unassigned code points, and that this mapping does not
include the UTF-16 surrogate range.

Rich


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: utf-8 and well-formed but illegal chars

Reply via email to