On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.)
(And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?

Thus, practically, programmers can (1) simply don't know the issue (2) have code that really works in typical use cases for their software (3) do not notice their code runs incorrectly. There is also an intermediate situation between (2) & (3), similar to old problems with previous ASCII-only apps: they work wrongly when used in a non-english environment, but what can users do, concretely? Most often, they just have to cope with incorrectness, reinterpret outputs differently, and/or find workarounds by cheating with the interface.

The responsability of designers of tools for programmers is, imo, important. We should make the issue clear, first (very difficult, it's an ubiquitous myth to break down), and propose services that run correctly in situations where said issue is relevant, here manipulation of universal text, even if not very efficient at start. On my side, and about D, I wish that most D programmers (1) are aware of the problem (2) understand its why's & how's (3) know there is a correct solution. Then, (4) use it actually is their choice (and I don't care whether or not they do).

It also supports this:

foreach(i, d; s)
{
writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to
specify dchar.

Except it breaks with combining characters. For instance, take the
string "t̃", which is two code points -- 't' followed by combining tilde
(U+0303) -- and you'll get the following output:

The character in position 0 is t
The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space
character.)

The conception of character that normal people have does not match the
notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes
systematically. Could you please post a few links that would educate me
and others in the mysteries of combining characters?

Beware! far too long text. https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction (the directory above contains the current rough implementation of Text, plus a bit of its brother package DUnicode)

Thanks,

Andrei

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to