Re: std.parallelism: Request for Review

dsimcha Sun, 27 Feb 2011 11:10:56 -0800

I've looked into this more. I realized that I'm only able to reproduceit when running Linux in a VM on top of Windows. When I reboot and runmy Linux distro in bare metal instead, I get decent (but not linear)speedups on the matrix benchmark. I'm guessing this is due to thingslike locking and context switches being less efficient/more expensive ina VM than on bare metal. In your case, having two physical CPUs inseparate sockets probably makes the atomic ops required for locking,context switches, etc. more expensive. From fiddling around, the GCthing actually appears to be a non-issue.

Since only the inner loop, not the outer loop, is easily parallelizable,I think a 256x256 matrix is really at the very edge of what's feasiblein terms of fine-grainedness. Each iteration of the outer loop onlytakes on the order of half a millisecond, in serial. This means we'retrying to parallelize an inner loop that only takes on the order of halfa CPU-millisecond to run. (This is the cost of the whole loop, start tofinish, not the cost of one iteration.) Slight changes in the costs ofvarious primitives (or having more cores to contest locks, invokecontext switches, etc.) can have a huge effect. I've changed to using a1024x1024 matrix instead, although this seems to be somewhat memorybandwidth-bound.

As a general statement, these benchmarks are much more fine-grained thanwhat I use std.parallelism for in the real world, both becausefine-grained examples were the only simple, non-domain-specific,dependency-free ones I could think of and to show that std.parallelismworks reasonably well (though certainly not perfectly) even with fairlyfine-grained parallelism. The unfortunate reality, though, is that thiskind of micro-parallelism is hard to implement efficiently and willprobably always (on every lib, not just mine) have performancecharacteristics that are highly dependent on hardware, OS primitives,etc. and require some tuning. This isn't to say that std.parallelism isthe best micro-parallelism lib out there, just that I highly doubt thatefficient general-case micro-parallelism is a totally solved problem, oris even practically solvable, and that these benchmarks illustrate afar-from-ideal case.


On 2/27/2011 1:44 PM, Russel Winder wrote:

David,

On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
[ . . . ]

Can you please re-run the benchmark to make sure that this isn't just a
one-time anomaly?  I can't seem to make the parallel matrix inversion
run slower than serial on my hardware, even with ridiculous tuning
parameters that I was almost sure would bottleneck the thing on the task
queue.  Also, all the other benchmarks actually look pretty good.


Sadly the result is consistent :-(

         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 60 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 61 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 59 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>

Re: std.parallelism: Request for Review

Reply via email to