Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag via Unicode Mon, 15 May 2017 10:59:23 -0700

On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:

On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> wrote:

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf


I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.

Changing a specification as fundamental as this is something that shouldnot be undertaken lightly.

Apparently we have a situation where implementations disagree, and havedone so for a while. This normally means not only that theimplementations differ, but that data exists in both formats.

Even if it were true that all data is only stored in UTF-8, any dataconverted from UFT-8 back to UTF-8 going through an interim stage thatrequires UTF-8 conversion would then be different based on whichconverter is used.

Implementations working in UTF-8 natively would potentially see threeformats:

1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD

These forms cannot be compared for equality by binary matching.

The best that can be done is to convert (1) into one of the other formsand then compare treating any run of FFFD code points as equal to anyother run, irrespective of length.(For security-critical applications, the presence of any FFFD shouldrender the data invalid, so the comparisons we'd be talking about herewould be for general purpose, like search).

Because we've had years of multiple implementations, it would beexpected that copious data exists in all three formats, and that datawill not go away. Changing the specification to pick one of theseformats as solely conformant is IMHO too late.

A./

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]

If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to