On 2/27/2011 9:48 AM, dsimcha wrote:
On 2/27/2011 8:03 AM, Russel Winder wrote:
32-bit mode on a 8-core (twin Xeon) Linux box. That core.cpuid bug
really, really sucks.

I see matrix inversion takes longer with 4 cores than with 1!


Actually, I am able to reproduce this, but only on Linux, and I think I figured out why. I think it's related to my Posix workaround for Bug 3753 (http://d.puremagic.com/issues/show_bug.cgi?id=3753). This workaround causes GC heap allocations to occur in a loop inside the matrix inversion routine (one for each call to parallel(), so 256 over the course of the benchmark). This was intended to be a very quick and dirty workaround for a DMD bug that I thought would get fixed a long time ago. It also seemed good enough at the time because I was using this lib for very coarse grained parallelism, where the effect is negligible.

Originally, I was using alloca() all over the place to efficiently deal with memory management. However, under Posix, I ran into Bug 3753 a long time ago and put in the following workaround, which simply forwards alloca() calls to the GC. From near the top of parallelism.d:

// Workaround for bug 3753.
version(Posix) {
    // Can't use alloca() because it can't be used with exception
    // handling.
    // Use the GC instead even though it's slightly less efficient.
    void* alloca(size_t nBytes) {
        return GC.malloc(nBytes);
    }
} else {
    // Can really use alloca().
    import core.stdc.stdlib : alloca;
}

In this particular use case the performance hit is probably substantial. There are ways to mitigate it (maybe having TaskPool maintain a free list, etc.), but I can't bring myself to put a lot of effort into optimizing a workaround for a compiler bug.

Reply via email to