On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:
ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.
Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake
You may think that. There are those of us who do not.
My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.
UTF-16 has some nice properties and there's not need to brand it a
"mistake". UTF-8 has different nice properties, but there's equally not
reason to treat it as more special than UTF-16.
The UTC should adopt a position of perfect neutrality when it comes to
assuming in-memory representation, in other words, not make assumptions
that optimizing for any encoding form will benefit implementers.
UTC, where ICU is strongly represented, needs to guard against basing
encoding/properties/algorithm decisions (edge cases mostly), solely or
primarily on the needs of a particular implementation that happens to be
chosen by the ICU project.
A./