Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Andrei Alexandrescu Tue, 11 Jan 2011 11:10:47 -0800

On 1/11/11 9:09 AM, spir wrote:

On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:

On 1/11/11 4:41 AM, Michel Fortin wrote:

On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
<[email protected]> said:

In addition to these (and connecting the two), a VLERange would offer
two additional primitives:


1. size_t stepSize(size_t offset) gives the length of the step needed
to skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_
step that goes to the previous element.


I like the idea, but I'm not sure about this interface. What's the
result of stepSize if your range must create two elements from one
underlying unit? Perhaps in those cases the element type could be an
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string and
converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
"oe" (one chararacter to two characters), in this case 'front' could
simply return "oe" (two characters) in one iteration, with stepSize
being the size of the "œ" code point. In the same conversion process,
encountering "e" followed by a combining "´" would return pre-combined
character "é" (two characters to one character).


In the design as I thought of it, the effective length of one logical
element is one or more representation units. My understanding is that
you are referring to a fractional number of representation units for one
logical element.


I think Michel is right. If I understand correctly, VLERange addresses
the low-level and rather simple issue of each codepoint beeing encoding
as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has
string/wstring/dstring, allowing to work with the most advatageous
encoding according to given source data, and dstring abstracting from
low-level encoding issues.

It' not about the data, it's about algorithms. Currently there arealgorithms that ostensibly work for bidirectional ranges, but internally"cheat" by detecting that the input is actually a string, and use thatknowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms.I wouldn't be surprised if an entire class of algorithms would in factrequire VLERange (e.g. many of those that we commonly consider today"string" algorithms).

The main (and massively ignored) issue when manipulating unicode text is
rather that, unlike with legacy character sets, one codepoint does *not*
represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't
understand your presentation, sorry. Do you for instance mean such a
range would, under the hood, group together codes belonging to the same
character (thus making indexing meaningful), and/or normalise (decomp &
order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, andpopBack - just like BidirectionalRange does right now. It would alsooffer access to the representational support by means of indexing - alsolike char[] et al already do now. The difference is that VLERange beinga formal concept, algorithms can specialize on it instead of (a)specializing for UTF strings or (b) specializing for BidirectionalRangeand then manually detecting isSomeString inside. Conversely, whendefining an algorithm you can specify VLARange as a requirement.Boyer-Moore is a perfect example - it doesn't work on bidirectionalranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges.Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no specialsupport for those beyond e.g. forward or bidirectional ranges.


* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with anumber.


* Others...?

Some of these are forward-only (don't allow bidirectional access). Oncewe have a number of examples, it would be great to figure a number ofremarkable algorithms operating on them.



Andrei

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to