RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode Mon, 15 May 2017 13:12:07 -0700

>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
>> multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not 
> be undertaken lightly.


IMO, the only think that can be agreed upon is that "something's bad with this 
UTF-8 data".  I think that whether it's treated as a single group of corrupt 
bytes or each individual byte is considered a problem should be up to the 
implementation.

#1 - This data should "never happen".  In a system behaving normally, this 
condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and 
want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that 
should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be 
more efficient.  I'd rather allow apps to optimize for the common case, not the 
case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, 
or even more.  The lead byte says how many trail bytes should follow, and those 
should be in a certain range.  Values outside of those conditions are illegal, 
so we shouldn't ever encounter them.  So if we did, then something really weird 
happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid 
sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have 
bad data, how bad I have no clue, but it's not right".  A single bit could've 
flipped, or you could have only 3 pages of a 4000 page document.  No clue at 
all.  At that point it doesn't really matter how many FFFD's the error(s) are 
replaced with, and no assumptions should be made about the severity of the 
error.

-Shawn

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to