On Fri, 14 Jan 2011 08:14:02 -0500, spir <[email protected]> wrote:

On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.) (And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?

* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :) Every time I try, I get 'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works.

-Steve

Reply via email to