On 1/1/2011 6:07 PM, Andrei Alexandrescu wrote:
* parallel is templated on range, but not on operation. Does this affect
speed for brief operations (such as the one given in the example,
squares[i] = i * i)? I wonder if using an alias wouldn't be more
appropriate. Some performance numbers would be very useful in any case.

Ok, I did the benchmarks. Since map is templated on the operation, I used that as a benchmark of the templating on operation scenario. Here's the benchmark:

import std.parallelism, std.stdio, std.datetime, std.range, std.conv,
    std.math;

int fun1(int num) {
    return roundTo!int(sqrt(num));
}

int fun2(int num) {
    return num * num;
}

alias fun2 fun;

void main() {
    auto foo = array(iota(10_000_000));
    auto bar = new int[foo.length];

    enum workUnitSize = 1_000_000;

    auto sw = StopWatch(autoStart);
    foreach(i, elem; parallel(foo, workUnitSize)) {
        bar[i] = fun(elem);
    }
    writeln("Parallel Foreach:  ", sw.peek.milliseconds);

    sw = StopWatch(autoStart);
    bar = taskPool.map!fun(foo, workUnitSize, bar);
    writeln("Map:  ", sw.peek.milliseconds);

    sw = StopWatch(autoStart);
    foreach(i, elem; foo) {
        bar[i] = fun(elem);
    }
    writeln("Serial:  ", sw.peek.milliseconds);
}


Results:

Parallel Foreach:  69.2988
Map:  29.1973
Serial:  40.2884


So obviously there's a huge penalty when the loop body is super cheap.

On the other hand, when I make fun1 the loop body instead (and it's still a fairly cheap body), the differences are buried in noise.

Now that I've given my honest report of the facts, though, I'd like to say that even so, I'm in favor of leaving things as-is, for the following reasons:

1. Super cheap loop bodies are usually not worth parallelizing anyhow. You get nowhere near a linear speedup due to memory bandwidth issues, etc., and if some super cheap loop body is your main bottleneck, it's probably being executed in in some outer loop and it may make more sense to parallelize the outer loop. In all my experience with std.parallelism, I've **never** had the the need/desire to resort to parallelism fine grained enough that the limitations of delegate-based parallel foreach mattered in practice.

2. If you really want to parallelize super cheap loop bodies, map() isn't going anywhere and that and/or reduce(), which also uses templates, will usually do what you need. You can even use parallel map in place by simply passing in the same (writeable) range for both the input and the buffer.

3. The foreach syntax makes the following very useful things (as in, I actually use them regularly) possible that wouldn't be possible if we used templates:

foreach(index, elem; parallel(range))
foreach(ref elem; parallel(range))

It also just plain looks nice.

4. A major point of parallel foreach is that variables in the outer scope "just work". When passing blocks of code as aliases instead of delegates, this is still very buggy.

5. I'm hoping I can convince Walter to implement an alias-based version of opApply, which is half-implemented and commented out in the DMD source code. If this were implemented, I'd change std.parallelism to use it and this whole discussion would be moot.

Reply via email to