On 31 May 2017, at 18:43, Shawn Steele via Unicode <unicode@unicode.org> wrote: > > It is unclear to me what the expected behavior would be for this corruption > if, for example, there were merely a half dozen 0x80 in the middle of ASCII > text? Is that garbage a single "character"? Perhaps because it's a > consecutive string of bad bytes? Or should it be 6 characters since they're > nonsense? Or maybe 2 characters because the maximum # of trail bytes we can > have is 3?
It should be six U+FFFD characters, because 0x80 is not a lead byte. Basically, the new proposal is that we should decode bytes that structurally match UTF-8, and if the encoding is then illegal (because it’s over-long, because it’s a surrogate or because it’s over U+10FFFF) then the entire thing is replaced with U+FFFD. If, on the other hand, we get a sequence that isn’t structurally valid UTF-8, we replace the maximally *structurally* valid subpart with U+FFFD and continue. > What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes? Then you get two U+FFFDs. > I can see how different implementations might be able to come up with "rules" > that would help them navigate (or clean up) those minefields, however it is > not at all clear to me that there is a "best practice" for those situations. I’m not sure the whole “best practice” thing has been a lot of help here. Perhaps we should change it to say “Suggested Handling”, to make quite clear that filing a bug report against code that chooses some other option is not necessary? > There also appears to be a special weight given to non-minimally-encoded > sequences. I don’t think that’s true, *although* it *is* true that UTF-8 decoders historically tended to allow such things, so one might assume that some software out there is generating them for whatever reason. There are also *deliberate* violations of the minimal length encoding specification in some cases (for instance to allow NUL to be encoded in such a way that it won’t terminate a C-style string). Yes, you may retort, that isn’t “valid UTF-8”. Sure. It *is* useful, though, and it is *in use*. If a UTF-8 decoder encounters such a thing, it’s more meaningful for whoever sees the output to see a single U+FFFD representing the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and then another for an “unexpected” trailing byte. Likewise, there are encoders that generate surrogates in UTF-8, which is, of course, illegal, but *does* happen. Again, they can provide reasonable justifications for their behaviour (typically they want the default binary sort to work the same as for UTF-16 for some reason), and again, replacing a single surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to whoever/whatever ends up seeing it. And, of course, there are encoders that are attempting to exploit security flaws, which will very definitely generate these kinds of things. > It would seem to me that none of these illegal sequences should appear in > practice, so we have either: > > * A bad encoder spewing out garbage (overlong sequences) > * Flipped bit(s) due to storage/transmission/whatever errors > * Lost byte(s) due to storage/transmission/coding/whatever errors > * Extra byte(s) due to whatever errors > * Bad string manipulation breaking/concatenating in the middle of sequences, > causing garbage (perhaps one of the above 2 codeing errors). I see no reason to suppose that the proposed behaviour would function any less well in those cases. > Only in the first case, of a bad encoder, are the overlong sequences actually > "real". And that shouldn't happen (it's a bad encoder after all). Except some encoders *deliberately* use over-longs, and one would assume that since UTF-8 decoders historically allowed this, there will be data “in the wild” that has this form. > The other scenarios seem just as likely, (or, IMO, much more likely) than a > badly designed encoder creating overlong sequences that appear to fit the > UTF-8 pattern but aren't actually UTF-8. I’m not sure I agree that flipped bits, lost bytes and extra bytes are more likely than a “bad” encoder. Bad string manipulation is of course prevalent, though - there’s no way around that. > The other cases are going to cause byte patterns that are less "obvious" > about how they should be navigated for various applications. This is true, *however* the new proposed behaviour is in no way inferior to the old proposed behaviour in those cases - it’s just different. Kind regards, Alastair. -- http://alastairs-place.net