algorithms that take ranges by reference

Andrei Alexandrescu Wed, 30 Dec 2009 11:55:10 -0800

The grand STL tradition is to _always_ take iterators by value. Ifiterators are computed, they are returned by the algorithm.

My initial approach to defining std.algorithm was to continue thattradition for ranges: ranges are values. No algorithm currently takes arange by reference. There are a couple of simple functions thatemphatically do take ref, namely popFrontN and popBackN in std.range.

It is becoming, however, increasingly clear that there are algorithmsthat could and should manipulate ranges by reference. They might takeand return values, but it's just too messy to do so. (Cue music for the"improve built-in tuples choir.)

A concrete case is text processing. Many contemporary libraries useindex-based processing, but that's difficult when handling multibytecharacters. To address that, bidirectional ranges are one correct way togo. Phobos defines a ByCodeUnit range that spans a string correctly, onedchar at a time. With that, you can write:


auto fname = "some-utf8-file.html";
auto txt = byDchar(readText(fname));
// use txt.empty, txt.front, txt.back, txt.popFront, txt.popBack

The front and back methods have type dchar.

So far so good. But then I noticed that many std.algorithms make itdifficult to play with the range comfortably. For example, say I want tomatch a tag, and special-case for a few particular tags:


while (!txt.empty) {
    auto c = txt.front;
    txt.popFront();
    if (c == '<') {
        if (startsWith(txt, "!--")) {
            // This is a comment
            popFrontN(txt, 3);
            txt = find(txt, "-->");
            popFrontN(txt, 3);
            ...
         } else if (startsWith(txt, "script")) {
         ...
         }
    }
    ...
}

This is already wasteful: startsWith scans a few characters off txt, andthen popFrontN does it again. In this particular case it's not all thatinefficient, but clearly the approach doesn't have elegance on its sideeither, so it's a lose-lose.

There's also the case issue with e.g. the "script" tag, but I hope theapproach outlined in the post "one step towards unification ofstd.algorithm and std.string" will take care of that.

Looking at samples like the above, some needed algorithms look like this(simplified by e.g. giving up custom predicates etc.):

/**

If r1 starts with r2, advance r1 past it and return true. Otherwise,leave r1 unchanged and return false.

*/
bool skip(R1, R2)(ref R1 r1, R2 r2) {
    auto r = r1; // .save();
    while (!r2.empty && !r.empty && r.front == r2.front) {
        r.popFront();
        r2.popFront();
    }
    if (r2.empty) {
        r1 = r;
        return true;
    }
    return false;
}

/**

If r2 can be find in r1, advance r1 past r2 and return true. Otherwise,leave r1 unchanged and return false.

*/
bool findSkip(R1, R2)(ref R1 r1, R2 r2) {
    auto r = r1; // .save();
    while (!r.empty && !startsWith(r1, r2)) {
        r.popFront();
        r2.popFront();
    }
    if (r2.empty) {
        r1 = r;
        return true;
    }
    return false;
}

With such functions the code looks much cleaner:

while (!txt.empty) {
    auto c = txt.front;
    txt.popFront();
    if (c == '<') {
        if (skip(txt, "!--")) {
            // This is a comment
            enforce(findSkip(txt, "-->"));
            ...
         } else if (skip(txt, "script")) {
         ...
         }
    }
    ...
}

But they break the tradition because now an algorithm may alter or not arange, and client code must be aware of that - one more thing to worryabout.


What do you think? Should we go with by-ref passing or not? Other ideas?



Andrei

algorithms that take ranges by reference

Reply via email to