On 1/11/11 9:09 AM, spir wrote:
On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
On 1/11/11 4:41 AM, Michel Fortin wrote:
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
<[email protected]> said:
In addition to these (and connecting the two), a VLERange would offer
two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed
to skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_
step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the
result of stepSize if your range must create two elements from one
underlying unit? Perhaps in those cases the element type could be an
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string and
converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
"oe" (one chararacter to two characters), in this case 'front' could
simply return "oe" (two characters) in one iteration, with stepSize
being the size of the "œ" code point. In the same conversion process,
encountering "e" followed by a combining "´" would return pre-combined
character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical
element is one or more representation units. My understanding is that
you are referring to a fractional number of representation units for one
logical element.

I think Michel is right. If I understand correctly, VLERange addresses
the low-level and rather simple issue of each codepoint beeing encoding
as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has
string/wstring/dstring, allowing to work with the most advatageous
encoding according to given source data, and dstring abstracting from
low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are algorithms that ostensibly work for bidirectional ranges, but internally "cheat" by detecting that the input is actually a string, and use that knowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms. I wouldn't be surprised if an entire class of algorithms would in fact require VLERange (e.g. many of those that we commonly consider today "string" algorithms).

The main (and massively ignored) issue when manipulating unicode text is
rather that, unlike with legacy character sets, one codepoint does *not*
represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't
understand your presentation, sorry. Do you for instance mean such a
range would, under the hood, group together codes belonging to the same
character (thus making indexing meaningful), and/or normalise (decomp &
order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now. The difference is that VLERange being a formal concept, algorithms can specialize on it instead of (a) specializing for UTF strings or (b) specializing for BidirectionalRange and then manually detecting isSomeString inside. Conversely, when defining an algorithm you can specify VLARange as a requirement. Boyer-Moore is a perfect example - it doesn't work on bidirectional ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges. Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no special support for those beyond e.g. forward or bidirectional ranges.

* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with a number.

* Others...?

Some of these are forward-only (don't allow bidirectional access). Once we have a number of examples, it would be great to figure a number of remarkable algorithms operating on them.


Andrei

Reply via email to