The grand STL tradition is to _always_ take iterators by value. If iterators are computed, they are returned by the algorithm.

My initial approach to defining std.algorithm was to continue that tradition for ranges: ranges are values. No algorithm currently takes a range by reference. There are a couple of simple functions that emphatically do take ref, namely popFrontN and popBackN in std.range.

It is becoming, however, increasingly clear that there are algorithms that could and should manipulate ranges by reference. They might take and return values, but it's just too messy to do so. (Cue music for the "improve built-in tuples choir.)

A concrete case is text processing. Many contemporary libraries use index-based processing, but that's difficult when handling multibyte characters. To address that, bidirectional ranges are one correct way to go. Phobos defines a ByCodeUnit range that spans a string correctly, one dchar at a time. With that, you can write:

auto fname = "some-utf8-file.html";
auto txt = byDchar(readText(fname));
// use txt.empty, txt.front, txt.back, txt.popFront, txt.popBack

The front and back methods have type dchar.

So far so good. But then I noticed that many std.algorithms make it difficult to play with the range comfortably. For example, say I want to match a tag, and special-case for a few particular tags:

while (!txt.empty) {
    auto c = txt.front;
    txt.popFront();
    if (c == '<') {
        if (startsWith(txt, "!--")) {
            // This is a comment
            popFrontN(txt, 3);
            txt = find(txt, "-->");
            popFrontN(txt, 3);
            ...
         } else if (startsWith(txt, "script")) {
         ...
         }
    }
    ...
}

This is already wasteful: startsWith scans a few characters off txt, and then popFrontN does it again. In this particular case it's not all that inefficient, but clearly the approach doesn't have elegance on its side either, so it's a lose-lose.

There's also the case issue with e.g. the "script" tag, but I hope the approach outlined in the post "one step towards unification of std.algorithm and std.string" will take care of that.

Looking at samples like the above, some needed algorithms look like this (simplified by e.g. giving up custom predicates etc.):

/**
If r1 starts with r2, advance r1 past it and return true. Otherwise, leave r1 unchanged and return false.
*/
bool skip(R1, R2)(ref R1 r1, R2 r2) {
    auto r = r1; // .save();
    while (!r2.empty && !r.empty && r.front == r2.front) {
        r.popFront();
        r2.popFront();
    }
    if (r2.empty) {
        r1 = r;
        return true;
    }
    return false;
}

/**
If r2 can be find in r1, advance r1 past r2 and return true. Otherwise, leave r1 unchanged and return false.
*/
bool findSkip(R1, R2)(ref R1 r1, R2 r2) {
    auto r = r1; // .save();
    while (!r.empty && !startsWith(r1, r2)) {
        r.popFront();
        r2.popFront();
    }
    if (r2.empty) {
        r1 = r;
        return true;
    }
    return false;
}

With such functions the code looks much cleaner:

while (!txt.empty) {
    auto c = txt.front;
    txt.popFront();
    if (c == '<') {
        if (skip(txt, "!--")) {
            // This is a comment
            enforce(findSkip(txt, "-->"));
            ...
         } else if (skip(txt, "script")) {
         ...
         }
    }
    ...
}

But they break the tradition because now an algorithm may alter or not a range, and client code must be aware of that - one more thing to worry about.

What do you think? Should we go with by-ref passing or not? Other ideas?



Andrei

Reply via email to