Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Thu, 18 May 2017 01:00:50 -0700

On 18 May 2017, at 06:01, Richard Wordingham via Unicode <unicode@unicode.org> 
wrote:
> 
> On Thu, 18 May 2017 02:04:55 +0200
> Philippe Verdy via Unicode <unicode@unicode.org> wrote:
> 
>> I find intriguating that the update intends to enforce the decoding
>> of the **shortest** sequences, but now wants to treat **maximal
>> sequences** as a single unit with arbitrary length. UTF-8 was
>> designed to work only with some state machines that would NEVER need
>> to parse more than 4 bytes.
> 
> If you look at the sample code in
> http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that
> it's working with 6-byte sequences.  It's the Unicode, as opposed to
> ISO 10646, version that has always been restricted to 4 bytes.


There are good reasons for restricting it to four byte sequences, mind; doing 
so increases the number of invalid code units, which makes it easier to detect 
UTF-8 versus not UTF-8.  I don’t think anyone is proposing allowing 5-byte or 
6-byte sequences.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to