Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

spir Fri, 14 Jan 2011 05:15:18 -0800

On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.


I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. Theissue *does* exist --as shown even by trivial examples such as Michel'sbelow, not corner cases. The actual question is _not_ whether code or"grapheme" is the proper level of abstraction. To this, the answer isclear: codes are simply meaningless in 99% cases. (All historic softwaredeal with chars, conceptually, but they happen too be coded with singlecodes.)

(And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring theissue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" inunicode literature. "Abstract" is very correct, but they should havefound another term as "character", say "abstract scripting mark". Theirdeceiving terminological choice lets most programmers believe thatcodepoints code characters, like in historic charsets.(Even worse: some doc explicitely states that ICU's notion of charactermatches the programming notion of character.)* ICU added precomposed codes for a bunch of characters, supposedly forbackward compatility with said charsets. (But where is the gain? We needto decode them anyway...) The consequence is, at the pedagogical level,very bad: most text-producing software (like editors) use suchprecomposed codes when available for a given character. So thatprogrammers can happily go on believing in the code=character myth.(Note: the gain in space is ridiculous for western text.)* Most characters that appear in western texts (at least "official"characters of natural languages) have precomposed forms.* Programmers can very easily be unaware their code is incorrect: how doyou even notice it in test output?

Thus, practically, programmers can (1) simply don't know the issue (2)have code that really works in typical use cases for their software (3)do not notice their code runs incorrectly.There is also an intermediate situation between (2) & (3), similar toold problems with previous ASCII-only apps: they work wrongly when usedin a non-english environment, but what can users do, concretely? Mostoften, they just have to cope with incorrectness, reinterpret outputsdifferently, and/or find workarounds by cheating with the interface.

The responsability of designers of tools for programmers is, imo,important. We should make the issue clear, first (very difficult, it'san ubiquitous myth to break down), and propose services that runcorrectly in situations where said issue is relevant, here manipulationof universal text, even if not very efficient at start.On my side, and about D, I wish that most D programmers (1) are aware ofthe problem (2) understand its why's & how's (3) know there is a correctsolution. Then, (4) use it actually is their choice (and I don't carewhether or not they do).

It also supports this:

foreach(i, d; s)
{
writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)


Well string supports that too, albeit with the nit that you need to
specify dchar.


Except it breaks with combining characters. For instance, take the
string "t̃", which is two code points -- 't' followed by combining tilde
(U+0303) -- and you'll get the following output:

The character in position 0 is t
The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space
character.)

The conception of character that normal people have does not match the
notion of code points when combining characters enters the equation.


This might be a good time to see whether we need to address graphemes
systematically. Could you please post a few links that would educate me
and others in the mysteries of combining characters?

Beware! far too long text.https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction(the directory above contains the current rough implementation of Text,plus a bit of its brother package DUnicode)

Thanks,

Andrei


Denis
_________________
vita es estrany
spir.wikidot.com

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to