On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu
<[email protected]> said:
On 12/31/11 10:47 AM, Michel Fortin wrote:
It seems to work fine, but it doesn't
handle (yet) characters spanning multiple code points.
That's the job of std.range, not std.algorithm.
As I keep saying, if you handle combining code points at the range
level you'll have very inefficient code. But I think you get that.
To handle this
case, you could use a logical glyph range, but that'd be quite
inefficient. Or you can improve the algorithm working on code points so
that it checks for combining characters on the edges, but then is it
still a generic algorithm?
Second, it doesn't work efficiently. Sure you can specialize the
algorithm so it does not decode all code units when it's not necessary,
but then does it still classify as a generic algorithm?
My point is that *generic* algorithms cannot work *efficiently* with
Unicode, not that they can't work at all. And even then, for the
inneficient generic algorithm to work correctly with all input, the user
need to choose the correct Unicode representation to for the problem at
hand, which requires some general knowledge of Unicode.
Which is why I'd just discourage generic algorithms for strings.
I think you are in a position that is defensible, but not generous and
therefore undesirable. The military equivalent would be defending a
fortified landfill drained by a sewer. You don't _want_ to be there.
I don't get the analogy.
Taking your argument to its ultimate conclusion is that we give up on
genericity for strings and go home.
That is more or less what I am saying. Genericity for strings leads to
inefficient algorithms, and you don't want inefficient algorithms, at
least not without being warned in advance. This is why for instance you
give a special name to inefficient (linear) operations in
std.container. In the same way, I think generic operations on strings
should be disallowed unless you opt-in by explicitly saying on which
representation you want to algorithm to perform its task.
This is the kind of "range" I'd use to create algorithms dealing with
Unicode properly:
struct UnicodeRange(U)
{
U frontUnit() @property;
dchar frontPoint() @property;
immutable(U)[] frontGlyph() @property;
void popFrontUnit();
void popFrontPoint();
void popFrontGlyph();
...
}
We already have most of that. For a string s, s[0] is frontUnit,
s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is
popFrontPoint. We only need to define the glyph routines.
Indeed. I came with this concept when writing my XML parser, I defined
frontUnit and popFrontUnit and used it all over the place (in
conjunction with slicing). And I rarely needed to decode whole code
points using front and popFront.
But I think you'd be stopping short. You want generic variable-length
encoding, not the above.
Really? How'd that work?
Except for the glpyhs implementation, we're already there. You are
talking about existing capabilities!
The problem with .raw is that it creates a separate range for the units.
That's the best part about it.
Depends. It should create a *linked* range, not a *separate* one, in
the sense that if you advance the "raw" range with popFront, it should
advance the underlying "code point" range too.
This means you can't look at the frontUnit and then decide to pop the
unit and then look at the next, decide you need to decode using
frontPoint, then call popPoint and return to looking at the front unit.
Of course you can.
while (condition) {
if (s.raw.front == someFrontUnitThatICareAbout) {
s.raw.popFront();
auto c = s.front;
s.popFront();
}
}
But will s.raw.popFront() also pop a single unit from s? "raw" would
need to be defined as a reinterpret cast of the reference to the char[]
to do what I want, something like this:
ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }
The current std.string.representation doesn't do that at all.
Also, how does it work with slicing? It can work with raw, but you'll
have to cast things everywhere because raw is a ubyte[]:
string = "éà";
s = cast(typeof(s))s.raw[0..4];
Now that I wrote it I'm even more enthralled with the coolness of the
scheme. You essentially have access to two separate ranges on top of
the same fabric.
Glad you like the concept.
--
Michel Fortin
[email protected]
http://michelf.com/