Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode Wed, 31 May 2017 11:39:31 -0700

On 31 May 2017, at 18:43, Shawn Steele via Unicode <unicode@unicode.org> wrote:
> 
> It is unclear to me what the expected behavior would be for this corruption 
> if, for example, there were merely a half dozen 0x80 in the middle of ASCII 
> text?  Is that garbage a single "character"?  Perhaps because it's a 
> consecutive string of bad bytes?  Or should it be 6 characters since they're 
> nonsense?  Or maybe 2 characters because the maximum # of trail bytes we can 
> have is 3?


It should be six U+FFFD characters, because 0x80 is not a lead byte.  
Basically, the new proposal is that we should decode bytes that structurally 
match UTF-8, and if the encoding is then illegal (because it’s over-long, 
because it’s a surrogate or because it’s over U+10FFFF) then the entire thing 
is replaced with U+FFFD.  If, on the other hand, we get a sequence that isn’t 
structurally valid UTF-8, we replace the maximally *structurally* valid subpart 
with U+FFFD and continue.

> What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

Then you get two U+FFFDs.

> I can see how different implementations might be able to come up with "rules" 
> that would help them navigate (or clean up) those minefields, however it is 
> not at all clear to me that there is a "best practice" for those situations.

I’m not sure the whole “best practice” thing has been a lot of help here.  
Perhaps we should change it to say “Suggested Handling”, to make quite clear 
that filing a bug report against code that chooses some other option is not 
necessary?

> There also appears to be a special weight given to non-minimally-encoded 
> sequences.

I don’t think that’s true, *although* it *is* true that UTF-8 decoders 
historically tended to allow such things, so one might assume that some 
software out there is generating them for whatever reason.

There are also *deliberate* violations of the minimal length encoding 
specification in some cases (for instance to allow NUL to be encoded in such a 
way that it won’t terminate a C-style string).  Yes, you may retort, that isn’t 
“valid UTF-8”.  Sure.  It *is* useful, though, and it is *in use*.  If a UTF-8 
decoder encounters such a thing, it’s more meaningful for whoever sees the 
output to see a single U+FFFD representing the illegally encoded NUL that it is 
to see two U+FFFDs, one for an invalid lead byte and then another for an 
“unexpected” trailing byte.

Likewise, there are encoders that generate surrogates in UTF-8, which is, of 
course, illegal, but *does* happen.  Again, they can provide reasonable 
justifications for their behaviour (typically they want the default binary sort 
to work the same as for UTF-16 for some reason), and again, replacing a single 
surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to 
whoever/whatever ends up seeing it.

And, of course, there are encoders that are attempting to exploit security 
flaws, which will very definitely generate these kinds of things.

>  It would seem to me that none of these illegal sequences should appear in 
> practice, so we have either:
> 
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of sequences, 
> causing garbage (perhaps one of the above 2 codeing errors).

I see no reason to suppose that the proposed behaviour would function any less 
well in those cases.

> Only in the first case, of a bad encoder, are the overlong sequences actually 
> "real".  And that shouldn't happen (it's a bad encoder after all).

Except some encoders *deliberately* use over-longs, and one would assume that 
since UTF-8 decoders historically allowed this, there will be data “in the 
wild” that has this form.

> The other scenarios seem just as likely, (or, IMO, much more likely) than a 
> badly designed encoder creating overlong sequences that appear to fit the 
> UTF-8 pattern but aren't actually UTF-8.

I’m not sure I agree that flipped bits, lost bytes and extra bytes are more 
likely than a “bad” encoder.  Bad string manipulation is of course prevalent, 
though - there’s no way around that.

> The other cases are going to cause byte patterns that are less "obvious" 
> about how they should be navigated for various applications.

This is true, *however* the new proposed behaviour is in no way inferior to the 
old proposed behaviour in those cases - it’s just different.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to