On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
On 1/11/11 4:41 AM, Michel Fortin wrote:
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
<[email protected]> said:
In addition to these (and connecting the two), a VLERange would offer
two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed
to skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_
step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the
result of stepSize if your range must create two elements from one
underlying unit? Perhaps in those cases the element type could be an
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string and
converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
"oe" (one chararacter to two characters), in this case 'front' could
simply return "oe" (two characters) in one iteration, with stepSize
being the size of the "œ" code point. In the same conversion process,
encountering "e" followed by a combining "´" would return pre-combined
character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical
element is one or more representation units. My understanding is that
you are referring to a fractional number of representation units for one
logical element.

I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right? If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues.

The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).?


denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to