On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake
You may think that.  There are those of us who do not.
My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

UTF-16 has some nice properties and there's not need to brand it a "mistake". UTF-8 has different nice properties, but there's equally not reason to treat it as more special than UTF-16.

The UTC should adopt a position of perfect neutrality when it comes to assuming in-memory representation, in other words, not make assumptions that optimizing for any encoding form will benefit implementers.

UTC, where ICU is strongly represented, needs to guard against basing encoding/properties/algorithm decisions (edge cases mostly), solely or primarily on the needs of a particular implementation that happens to be chosen by the ICU project.

A./

Reply via email to