On 1/11/11 4:46 PM, spir wrote:
On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
The main (and massively ignored) issue when manipulating unicode text is
rather that, unlike with legacy character sets, one codepoint does *not*
represent a character in the common sense. In character sets like
latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't
understand your presentation, sorry. Do you for instance mean such a
range would, under the hood, group together codes belonging to the same
character (thus making indexing meaningful), and/or normalise (decomp &
order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and
popBack - just like BidirectionalRange does right now. It would also
offer access to the representational support by means of indexing - also
like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying
fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

auto text = "a\u0302"d;
writeln(text); // "â"
auto range = VLERange(text);
// extracts characters correctly?
auto letter = range.front(); // "a" or "â"?
// case yes: compares correctly?
assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o).

Andrei

Reply via email to