I've looked into this more. I realized that I'm only able to reproduce it when running Linux in a VM on top of Windows. When I reboot and run my Linux distro in bare metal instead, I get decent (but not linear) speedups on the matrix benchmark. I'm guessing this is due to things like locking and context switches being less efficient/more expensive in a VM than on bare metal. In your case, having two physical CPUs in separate sockets probably makes the atomic ops required for locking, context switches, etc. more expensive. From fiddling around, the GC thing actually appears to be a non-issue.

Since only the inner loop, not the outer loop, is easily parallelizable, I think a 256x256 matrix is really at the very edge of what's feasible in terms of fine-grainedness. Each iteration of the outer loop only takes on the order of half a millisecond, in serial. This means we're trying to parallelize an inner loop that only takes on the order of half a CPU-millisecond to run. (This is the cost of the whole loop, start to finish, not the cost of one iteration.) Slight changes in the costs of various primitives (or having more cores to contest locks, invoke context switches, etc.) can have a huge effect. I've changed to using a 1024x1024 matrix instead, although this seems to be somewhat memory bandwidth-bound.

As a general statement, these benchmarks are much more fine-grained than what I use std.parallelism for in the real world, both because fine-grained examples were the only simple, non-domain-specific, dependency-free ones I could think of and to show that std.parallelism works reasonably well (though certainly not perfectly) even with fairly fine-grained parallelism. The unfortunate reality, though, is that this kind of micro-parallelism is hard to implement efficiently and will probably always (on every lib, not just mine) have performance characteristics that are highly dependent on hardware, OS primitives, etc. and require some tuning. This isn't to say that std.parallelism is the best micro-parallelism lib out there, just that I highly doubt that efficient general-case micro-parallelism is a totally solved problem, or is even practically solvable, and that these benchmarks illustrate a far-from-ideal case.

On 2/27/2011 1:44 PM, Russel Winder wrote:
David,

On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
[ . . . ]
Can you please re-run the benchmark to make sure that this isn't just a
one-time anomaly?  I can't seem to make the parallel matrix inversion
run slower than serial on my hardware, even with ridiculous tuning
parameters that I was almost sure would bottleneck the thing on the task
queue.  Also, all the other benchmarks actually look pretty good.

Sadly the result is consistent :-(

         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 60 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 61 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 59 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>  matrixInversion
         Inverted a 256 x 256 matrix serially in 58 milliseconds.
         Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
         506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
         |>




Reply via email to