On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
I think Unicode should not adopt the proposed change.
Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting
multiple errors there makes no sense.
Changing a specification as fundamental as this is something that should
not be undertaken lightly.
Apparently we have a situation where implementations disagree, and have
done so for a while. This normally means not only that the
implementations differ, but that data exists in both formats.
Even if it were true that all data is only stored in UTF-8, any data
converted from UFT-8 back to UTF-8 going through an interim stage that
requires UTF-8 conversion would then be different based on which
converter is used.
Implementations working in UTF-8 natively would potentially see three
formats:
1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD
These forms cannot be compared for equality by binary matching.
The best that can be done is to convert (1) into one of the other forms
and then compare treating any run of FFFD code points as equal to any
other run, irrespective of length.
(For security-critical applications, the presence of any FFFD should
render the data invalid, so the comparisons we'd be talking about here
would be for general purpose, like search).
Because we've had years of multiple implementations, it would be
expected that copious data exists in all three formats, and that data
will not go away. Changing the specification to pick one of these
formats as solely conformant is IMHO too late.
A./
ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.
Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake
You may think that. There are those of us who do not. The fact is that UTF-16
makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient
for primarily ASCII text, but that is not the case for other situations and the
fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4
usually focus on) is no more complicated than handling combining characters,
which you have to do anyway.
Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.
I don’t think the Unicode Consortium should be unsympathetic to people who use
UTF-8 internally, for sure, but I don’t see what that has to do with either the
original proposal or with your criticism of UTF-16.
[snip]
If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.
“Impossible”? Why? You just need to add some error states (or *an* error
state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the
only library that already did just that *because it’s clearly the right thing
to do*.
Kind regards,
Alastair.
--
http://alastairs-place.net