On 03/06/2016 20:12, Dmitry Olshansky wrote:
On 02-Jun-2016 23:27, Walter Bright wrote:

I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.

Yeah, Unicode was not meant to be easy it seems. Or this is whatever
happens with evolutionary design that started with "everything is a
16-bit character".


Typing as someone who as spent some time creating typefaces, having two representations makes sense, and it didn't start with Unicode, it started with movable type.

It is much easier for a font designer to create the two codepoint versions of characters for most instances, i.e. make the base letters and the diacritics once. Then what I often do is make single codepoint versions of the ones I'm likely to use, but only if they need more tweaking than the kerning options of the font format allow. I'll omit the history lesson on how this was similar in the case of movable type.

Keyboards for different languages mean that a character that is a single keystroke in one case is two together or in sequence in another. This means that Unicode not only represents completed strings, but also those that are mid composition. The ordering that it uses to ensure that graphemes have a single canonical representation is based on the order that those multi-key characters are entered. I wouldn't call it elegant, but its not inelegant either.

Trying to represent all sufficiently similar glyphs with the same codepoint would lead to a layout problem. How would you order them so that strings of any language can be sorted by their local sorting rules, without having to special case algorithms?

Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", "ffl" and many, many more. Typographers create these glyphs whenever available kerning tools do a poor job of combining them from the individual glyphs. From the point of view of meaning they should still be represented as individual codepoints, but for display (electronic or print) that sequence needs to be replaced with the single codepoint for the ligature.

I think that in order to understand the decisions of the Unicode committee, one has to consider that they are trying to unify the concerns of representing written information from two sides. One side prioritises storage and manipulation, while the other considers aesthetics and design workflow more important. My experience of using Unicode from both sides gives me a different appreciation for the difficulties of reconciling the two.

A...

P.S.

Then they started adding emojis, and I lost all faith in humanity ;)

Reply via email to