On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
The main (and massively ignored) issue when manipulating unicode text is
rather that, unlike with legacy character sets, one codepoint does *not*
represent a character in the common sense. In character sets like
latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't
understand your presentation, sorry. Do you for instance mean such a
range would, under the hood, group together codes belonging to the same
character (thus making indexing meaningful), and/or normalise (decomp &
order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and
popBack - just like BidirectionalRange does right now. It would also
offer access to the representational support by means of indexing - also
like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

    auto text = "a\u0302"d;
    writeln(text);                  // "â"
    auto range = VLERange(text);
    // extracts characters correctly?
    auto letter = range.front();    // "a" or "â"?
    // case yes: compares correctly?
    assert(range.front() == "â");   // fail or pass?

Both fail using all unicode-aware types I know of, because
1. They do not recognise that a character is represented by an arbitrary number of codes (code _points_).
2. They do not use normalised forms for comp, search, count, etc...
(while in unicode a given char can have several representations).

The difference is that VLERange being
a formal concept, algorithms can specialize on it instead of (a)
specializing for UTF strings or (b) specializing for BidirectionalRange
and then manually detecting isSomeString inside. Conversely, when
defining an algorithm you can specify VLARange as a requirement.
Boyer-Moore is a perfect example - it doesn't work on bidirectional
ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an abstraction. But it would certainly be more useful in other fields than text manipulation, because there are far more annoying issues (that, like in example above, simply prevent code correctness).

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to