On Fri, 14 Jan 2011 08:14:02 -0500, spir <[email protected]> wrote:
On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:
That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.
I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).
I'm aware of that, and I have no definitive answer to the question. The
issue *does* exist --as shown even by trivial examples such as Michel's
below, not corner cases. The actual question is _not_ whether code or
"grapheme" is the proper level of abstraction. To this, the answer is
clear: codes are simply meaningless in 99% cases. (All historic software
deal with chars, conceptually, but they happen too be coded with single
codes.)
(And what about Objective-C? Why did its designers even bother with
that?).
The question is rather: why do we nearly all happily go on ignoring the
issue? My present guess is a combination of factors:
* The issue is masked by the misleading use of "abstract character" in
unicode literature. "Abstract" is very correct, but they should have
found another term as "character", say "abstract scripting mark". Their
deceiving terminological choice lets most programmers believe that
codepoints code characters, like in historic charsets.
(Even worse: some doc explicitely states that ICU's notion of character
matches the programming notion of character.)
* ICU added precomposed codes for a bunch of characters, supposedly for
backward compatility with said charsets. (But where is the gain? We need
to decode them anyway...) The consequence is, at the pedagogical level,
very bad: most text-producing software (like editors) use such
precomposed codes when available for a given character. So that
programmers can happily go on believing in the code=character myth.
(Note: the gain in space is ridiculous for western text.)
* Most characters that appear in western texts (at least "official"
characters of natural languages) have precomposed forms.
* Programmers can very easily be unaware their code is incorrect: how do
you even notice it in test output?
* I don't even know how to make a grapheme that is more than one
code-unit, let alone more than one code-point :) Every time I try, I get
'invalid utf sequence'.
I feel significantly ignorant on this issue, and I'm slowly getting enough
knowledge to join the discussion, but being a dumb American who only
speaks English, I have a hard time grasping how this shit all works.
-Steve