On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could
equally well
be expressed separately. Normalization was the inevitable
consequence.
It is not inevitable. Simply disallow the 2 codepoint sequences
- the single one has to be used instead.
There is precedent. Some characters can be encoded with more
than one UTF-8 sequence, and the longer sequences were declared
invalid. Simple.
I.e. have the normalization up front when the text is created
rather than everywhere else.
I don't think it would work (or at least, the analogy doesn't
hold). It would mean that you can't add new precomposited
characters, because that means that previously valid sequences
are now invalid.