On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.
Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting 
multiple errors there makes no sense.

Changing a specification as fundamental as this is something that should not be undertaken lightly.

Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats.

Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used.

Implementations working in UTF-8 natively would potentially see three formats:
1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD

These forms cannot be compared for equality by binary matching.

The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length. (For security-critical applications, the presence of any FFFD should render the data invalid, so the comparisons we'd be talking about here would be for general purpose, like search).

Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late.

A./



ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake
You may think that.  There are those of us who do not.  The fact is that UTF-16 
makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient 
for primarily ASCII text, but that is not the case for other situations and the 
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 
usually focus on) is no more complicated than handling combining characters, 
which you have to do anyway.

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.
I don’t think the Unicode Consortium should be unsympathetic to people who use 
UTF-8 internally, for sure, but I don’t see what that has to do with either the 
original proposal or with your criticism of UTF-16.

[snip]

If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.
“Impossible”?  Why?  You just need to add some error states (or *an* error 
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the 
only library that already did just that *because it’s clearly the right thing 
to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net




Reply via email to