Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Tue, 16 May 2017 03:14:40 -0700

On 16 May 2017, at 09:31, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
> 
> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
> <alast...@alastairs-place.net> wrote:
>> That would be true if the in-memory representation had any effect on what 
>> we’re talking about, but it really doesn’t.
> 
> If the internal representation is UTF-16 (or UTF-32), it is a likely
> design that there is a variable into which the scalar value of the
> current code point is accumulated during UTF-8 decoding.


That’s quite a likely design with a UTF-8 internal representation too; it’s 
just that you’d only decode during processing, as opposed to immediately at 
input.

> When the internal representation is UTF-8, only UTF-8 validation is
> needed, and it's natural to have a fail-fast validator, which *doesn't
> necessarily need such a scalar value accumulator at all*.

Sure.  But a state machine can still contain appropriate error states without 
needing an accumulator.  That the ones you care about currently don’t is 
readily apparent, but there’s nothing stopping them from doing so.

I don’t see this as an argument about implementations, since it really makes 
very little difference to the implementation which approach is taken; in both 
internal representations, the question is whether you generate U+FFFD 
immediately on detection of the first incorrect *byte*, or whether you do so 
after reading a complete sequence.  UTF-8 sequences are bounded anyway, so it 
isn’t as if failing early gives you any significant performance benefit.

>> In what sense is this “interop”?
> 
> In the sense that prominent independent implementations do the same
> externally observable thing.

The argument is, I think, that in this case the thing they are doing is the 
*wrong* thing.  That many of them do it would only be an argument if there was 
some reason that it was desirable that they did it.  There doesn’t appear to be 
such a reason, unless you can think of something that hasn’t been mentioned 
thus far?  The only reason you’ve given, to date, is that they currently do 
that, so that should be the recommended behaviour (which is little different 
from the argument - which nobody deployed - that ICU currently does the other 
thing, so *that* should be the recommended behaviour; the only difference is 
that *you* care about browsers and don’t care about ICU, whereas you yourself 
suggested that some of us might be advocating this decision because we care 
about ICU and not about e.g. browsers).

I’ll add also that even among the implementations you cite, some of them permit 
surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not 
UTF-8 anyway).  Python, for example, certainly accepts the sequence [ed a0 bd 
ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that 
conformed literally to the recommendation, as you seem to want, should instead 
replace it with *four* U+FFFDs (I think), no?

One additional note: the standard codifies this behaviour as a 
*recommendation*, not a requirement.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to