On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote: > On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote: > > Saying that operating at the code point level - UTF-32 - is correct > > is like saying that operating at UTF-16 instead of UTF-8 is correct. > > Could you please substantiate that? My understanding is that code unit > is a higher-level Unicode notion independent of encoding, whereas code > point is an encoding-dependent representation detail. -- Andrei
Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point). assert("A"c.length == 1); assert("A"w.length == 1); assert("A"d.length == 1); If you have 月, then you get assert("月"c.length == 3); assert("月"w.length == 1); assert("月"d.length == 1); whereas if you have 𐀆, then you get assert("𐀆"c.length == 4); assert("𐀆"w.length == 2); assert("𐀆"d.length == 1); So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for holding an entire character, but it still looks like UTF-32 does. However, what about characters like é or שׂ? Notice that שׂ takes up more than one code point. assert("שׂ"c.length == 4); assert("שׂ"w.length == 2); assert("שׂ"d.length == 2); It's ש with some sort of dot marker on it that they have in Hebrew, but it's a single character in spite of the fact that it's multiple code points. é is in a similar, though more complicated boat. With D, you'll get assert("é"c.length == 2); assert("é"w.length == 1); assert("é"d.length == 1); because the compiler decides to use the version of é that's a single code point. However, Unicode is set up so that that accent can be its own code point and be applied to any other code point - be it an e, an a, or even something like the number 0. If we normalize é, we can see other versions of it that take up more than one code point. e.g. assert("é"d.normalize!NFC.length == 1); assert("é"d.normalize!NFD.length == 2); assert("é"d.normalize!NFKC.length == 1); assert("é"d.normalize!NFKD.length == 2); And you can even put that accent on 0 by doing something like assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme. So, while there is a definite layer of separation between code units and code points, it's still the case that a single code point is not guaranteed to be a single character. You do indeed have encodings with code units and not code points (though those still have different normalizations, which is kind of like having different encodings), but in terms of correctness, you have the same problem with treating code points as characters that you have as treating code units as characters. You're still not guaranteed that you're operating on full characters and risk chopping them up. It's just that at the code point level, you're generally chopping something up that is visually separable (like an accent from a letter or a superscript on a symbol), whereas with code units, you end up with utter garbage when you chop them incorrectly. By operating at the code point level, we're correct for _way_ more characters than we would be than if we treated char like a full character, but we're still not fully correct, and it's a lot harder to notice when you screw it up, because the number of characters which are handled incorrectly is far smaller. - Jonathan M Davis