> > The proposal actually does cover things that aren’t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences > (again, both reasonable). However, I’m not entirely certain about things > like > > e0 e0 c3 89 > > which the proposal would appear to decode as > > U+FFFD U+FFFD U+FFFD U+FFFD (3) > > instead of a perhaps more reasonable > > U+FFFD U+FFFD U+00C9 (4) > > (the key part is the “without ever restricting trail bytes to less than > 80..BF”) >
I also agree with that, due to access in strings from random position: if you access it from byte 0x89, you can assume it's a trialing byte and you'll want to look backward, and will see 0xc3,0x89 which will decode correctly as U+00C9 without any error detected. So the wrong bytes are only the initial two occurences of 0x80 which are individually converted to U+FFFD. In summary: when you detect any ill-formed sequence, only replace the first code unit by U+FFFD and restart scanning from the next code unit, without skeeping over multiple bytes. This means that multiple occurences of U+FFFD is not only the best practice, it also matches the intended design of UTF-8 to allow access from random positions.