On 2/27/2011 8:03 AM, Russel Winder wrote:
32-bit mode on a 8-core (twin Xeon) Linux box.  That core.cpuid bug
really, really sucks.

I see matrix inversion takes longer with 4 cores than with 1!

Can you please re-run the benchmark to make sure that this isn't just a one-time anomaly? I can't seem to make the parallel matrix inversion run slower than serial on my hardware, even with ridiculous tuning parameters that I was almost sure would bottleneck the thing on the task queue. Also, all the other benchmarks actually look pretty good.

It's possible that machines with multiple physical CPUs are much more likely to bottleneck on the task queue because synchronized blocks cost a few more clock cycles. It's also possible that stack alignment issues are creeping in somewhere I hadn't anticipated, or that using 4 cores instead of two on a fairly fine-grained benchmark is enough to bottleneck on the queue (though I doubt this because this benchmark worked well for others with quad cores).


Reply via email to