Re: string is rarely useful as a function argument

Michel Fortin Sat, 31 Dec 2011 12:45:49 -0800

On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu<[email protected]> said:

On 12/31/11 10:47 AM, Michel Fortin wrote:

It seems to work fine, but it doesn't
handle (yet) characters spanning multiple code points.


That's the job of std.range, not std.algorithm.

As I keep saying, if you handle combining code points at the rangelevel you'll have very inefficient code. But I think you get that.

To handle this
case, you could use a logical glyph range, but that'd be quite
inefficient. Or you can improve the algorithm working on code points so
that it checks for combining characters on the edges, but then is it
still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the
algorithm so it does not decode all code units when it's not necessary,
but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with
Unicode, not that they can't work at all. And even then, for the
inneficient generic algorithm to work correctly with all input, the user
need to choose the correct Unicode representation to for the problem at
hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous andtherefore undesirable. The military equivalent would be defending afortified landfill drained by a sewer. You don't _want_ to be there.


I don't get the analogy.

Taking your argument to its ultimate conclusion is that we give up ongenericity for strings and go home.

That is more or less what I am saying. Genericity for strings leads toinefficient algorithms, and you don't want inefficient algorithms, atleast not without being warned in advance. This is why for instance yougive a special name to inefficient (linear) operations instd.container. In the same way, I think generic operations on stringsshould be disallowed unless you opt-in by explicitly saying on whichrepresentation you want to algorithm to perform its task.

This is the kind of "range" I'd use to create algorithms dealing with
Unicode properly:

struct UnicodeRange(U)
{
U frontUnit() @property;
dchar frontPoint() @property;
immutable(U)[] frontGlyph() @property;

void popFrontUnit();
void popFrontPoint();
void popFrontGlyph();

...
}
We already have most of that. For a string s, s[0] is frontUnit,s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() ispopFrontPoint. We only need to define the glyph routines.

Indeed. I came with this concept when writing my XML parser, I definedfrontUnit and popFrontUnit and used it all over the place (inconjunction with slicing). And I rarely needed to decode whole codepoints using front and popFront.

But I think you'd be stopping short. You want generic variable-lengthencoding, not the above.


Really? How'd that work?

Except for the glpyhs implementation, we're already there. You aretalking about existing capabilities!
The problem with .raw is that it creates a separate range for the units.
That's the best part about it.

Depends. It should create a *linked* range, not a *separate* one, inthe sense that if you advance the "raw" range with popFront, it shouldadvance the underlying "code point" range too.

This means you can't look at the frontUnit and then decide to pop the
unit and then look at the next, decide you need to decode using
frontPoint, then call popPoint and return to looking at the front unit.


Of course you can.

while (condition) {
   if (s.raw.front == someFrontUnitThatICareAbout) {
      s.raw.popFront();
      auto c = s.front;
      s.popFront();
   }
}

But will s.raw.popFront() also pop a single unit from s? "raw" wouldneed to be defined as a reinterpret cast of the reference to the char[]to do what I want, something like this:


        ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

The current std.string.representation doesn't do that at all.

Also, how does it work with slicing? It can work with raw, but you'llhave to cast things everywhere because raw is a ubyte[]:


        string = "éà";
        s = cast(typeof(s))s.raw[0..4];

Now that I wrote it I'm even more enthralled with the coolness of thescheme. You essentially have access to two separate ranges on top ofthe same fabric.


Glad you like the concept.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: string is rarely useful as a function argument

Reply via email to