Markus Kuhn wrote on 3 September:

> It now contains an additional section 5 with UTF-8 sequences for illegal
> code positions that a good decoder should reject (surrogates, U+FFFE,
> U+FFFF)

and on 24 July:

> ISO 10646-1 ... says in section R.4 at least:
> 
>   NOTE 3 - Values of x in the range 0000D800 .. 0000DFFF are reserved for the
>   UTF-16 form and do not occur in UCS-4. The values 0000FFFE and 0000FFFF
>   also do not occur (see clause 8). The mappings of these code positions
>   in UTF-8 are undefined.
> 
> So all UTF-8 sequences that represent any of these UCS-4 values can and
> should be treated like malformed sequences. There are far more potential
> security loopholes to be suppressed here. For example, UTF-8 encoded
> U+FFFE must be supressed in a decoder, because it could be used by an
> attacker to create havoc with a false BOM in some applications, and
> U+FFFF similarly might have special applications.

Please explain why an UTF-8 to UCS-4 decoder should do anything with
surrogates.

One of the beauties of UTF-8 is that its decoder from/to UCS-4 is so
simple. We shouldn't make it more complicated than necessary.

There _is_ a security problem if an UTF-8 sequence containing a low
surrogate followed by a high surrogate is converted to UTF-16 and then
interpreted as a single character in UTF-16. But this is a problem of
UTF-16, not of UTF-8 or UCS-4. Therefore an UTF-8 to UTF-16 converter
should reject values between 0xD800 and 0xDFFF, and an UCS-4 to UTF-16
converter as well, but an UTF-8 to UCS-4 converter needs not.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to