Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Tue, 16 May 2017 00:47:57 -0700

On 16 May 2017, at 08:22, Asmus Freytag via Unicode <unicode@unicode.org> wrote:


> I therefore think that Henri has a point when he's concerned about tacit 
> assumptions favoring one memory representation over another, but I think the 
> way he raises this point is needlessly antagonistic.

That would be true if the in-memory representation had any effect on what we’re 
talking about, but it really doesn’t.

(The only time I can think of that the in-memory representation has a 
significant effect is where you’re talking about default binary ordering of 
string data, in which case, in the presence of non-BMP characters, UTF-8 and 
UCS-4 sort the same way, but because the surrogates are “in the wrong place”, 
UTF-16 doesn’t.  I think everyone is well aware of that, no?)

>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>> test with three major browsers that use UTF-16 internally and have
>> independent (of each other) implementations of UTF-8 decoding
>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>> Unicode standard away from that kind of interop needs *way* better
>> rationale than "feels right”.

In what sense is this “interop”?  Under what circumstance would it matter how 
many U+FFFDs you see?  If you’re about to mutter something about security, 
consider this: security code *should* refuse to compare strings that contain 
U+FFFD (or at least should never treat them as equal, even to themselves), 
because it has no way to know what that code point represents.

Would you advocate replacing

  e0 80 80

with

  U+FFFD U+FFFD U+FFFD     (1)

rather than

  U+FFFD                   (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while 
we certainly don’t want to decode it as a NUL (that was the source of previous 
security bugs, as I recall), I also don’t see the logic in insisting that it 
must be decoded to *three* code points when it clearly only represented one in 
the input.

This isn’t just a matter of “feels nicer”.  (1) is simply illogical behaviour, 
and since behaviours (1) and (2) are both clearly out there today, it makes 
sense to pick the more logical alternative as the official recommendation.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to