Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer Fri, 14 Jan 2011 05:40:33 -0800

On Fri, 14 Jan 2011 08:14:02 -0500, spir <[email protected]> wrote:

On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:
That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.
I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).
I'm aware of that, and I have no definitive answer to the question. Theissue *does* exist --as shown even by trivial examples such as Michel'sbelow, not corner cases. The actual question is _not_ whether code or"grapheme" is the proper level of abstraction. To this, the answer isclear: codes are simply meaningless in 99% cases. (All historic softwaredeal with chars, conceptually, but they happen too be coded with singlecodes.)(And what about Objective-C? Why did its designers even bother withthat?).
The question is rather: why do we nearly all happily go on ignoring theissue? My present guess is a combination of factors:
* The issue is masked by the misleading use of "abstract character" inunicode literature. "Abstract" is very correct, but they should havefound another term as "character", say "abstract scripting mark". Theirdeceiving terminological choice lets most programmers believe thatcodepoints code characters, like in historic charsets.(Even worse: some doc explicitely states that ICU's notion of charactermatches the programming notion of character.)* ICU added precomposed codes for a bunch of characters, supposedly forbackward compatility with said charsets. (But where is the gain? We needto decode them anyway...) The consequence is, at the pedagogical level,very bad: most text-producing software (like editors) use suchprecomposed codes when available for a given character. So thatprogrammers can happily go on believing in the code=character myth.(Note: the gain in space is ridiculous for western text.)* Most characters that appear in western texts (at least "official"characters of natural languages) have precomposed forms.* Programmers can very easily be unaware their code is incorrect: how doyou even notice it in test output?

* I don't even know how to make a grapheme that is more than onecode-unit, let alone more than one code-point :) Every time I try, I get'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting enoughknowledge to join the discussion, but being a dumb American who onlyspeaks English, I have a hard time grasping how this shit all works.


-Steve

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to