Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag via Unicode Mon, 15 May 2017 13:54:57 -0700

On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

UTF-16 has some nice properties and there's not need to brand it a"mistake". UTF-8 has different nice properties, but there's equally notreason to treat it as more special than UTF-16.

The UTC should adopt a position of perfect neutrality when it comes toassuming in-memory representation, in other words, not make assumptionsthat optimizing for any encoding form will benefit implementers.

UTC, where ICU is strongly represented, needs to guard against basingencoding/properties/algorithm decisions (edge cases mostly), solely orprimarily on the needs of a particular implementation that happens to bechosen by the ICU project.

A./

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to