Re: The Case Against Autodecode

Andrei Alexandrescu via Digitalmars-d Tue, 31 May 2016 11:36:14 -0700

On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:

On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:

On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:

Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.


Could you please substantiate that? My understanding is that code unit
is a higher-level Unicode notion independent of encoding, whereas code
point is an encoding-dependent representation detail. -- Andrei


Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
UTF-16 code unit, and one UTF-32 code unit (so, one code point).

assert("A"c.length == 1);
assert("A"w.length == 1);
assert("A"d.length == 1);

If you have 月, then you get

assert("月"c.length == 3);
assert("月"w.length == 1);
assert("月"d.length == 1);

whereas if you have 𐀆, then you get

assert("𐀆"c.length == 4);
assert("𐀆"w.length == 2);
assert("𐀆"d.length == 1);

So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
holding an entire character, but it still looks like UTF-32 does.


Does walkLength yield the same number for all representations?

However,
what about characters like é or שׂ? Notice that שׂ takes up more than one code
point.

assert("שׂ"c.length == 4);
assert("שׂ"w.length == 2);
assert("שׂ"d.length == 2);

It's ש with some sort of dot marker on it that they have in Hebrew, but it's
a single character in spite of the fact that it's multiple code points. é is
in a similar, though more complicated boat. With D, you'll get

assert("é"c.length == 2);
assert("é"w.length == 1);
assert("é"d.length == 1);

because the compiler decides to use the version of é that's a single code
point.


Does walkLength yield the same number for all representations?

However, Unicode is set up so that that accent can be its own code
point and be applied to any other code point - be it an e, an a, or even
something like the number 0. If we normalize é, we can see other
versions of it that take up more than one code point. e.g.

assert("é"d.normalize!NFC.length == 1);
assert("é"d.normalize!NFD.length == 2);
assert("é"d.normalize!NFKC.length == 1);
assert("é"d.normalize!NFKD.length == 2);


Does walkLength yield the same number for all representations?

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or more
code points also combine to make a grapheme.

That's right. D's handling of UTF is at the code unit level (like all ofUnicode is portably defined). If you want graphemes use byGrapheme.


It seems you destroyed your own argument, which was:

Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.


You can't claim code units are just a special case of code points.


Andrei

Re: The Case Against Autodecode

Reply via email to