On 10/05/2014 07:27 AM, flamencofantasy wrote:

> I am summing up the first 1 billion integers in parallel and in a single
> thread and I'm observing some curious results;
>
> parallel sum : 499999999500000000, elapsed 102833 ms
> single thread sum : 499999999500000000, elapsed 1667 ms
>
> The parallel version is 60+ times slower

Reducing the number of threads is key. However, unlike what others said, parallel() does not use that many threads. By default, TaskPool objects are constructed by 'totalCPUs - 1' worker threads. All of parallel()'s iteration are executed on that few threads.

The main problem here is the use of atomicOp, which necessarily synchronizes the whole process.

Something like the following takes advantage of parallelism and reduces the execution time by half on my machine (4 cores (hyperthreaded 2 actul ones)).

    ulong adder(ulong beg, ulong end)
    {
        ulong localSum = 0;

        foreach (i; beg .. end) {
            localSum += i;
        }

        return localSum;
    }

    enum totalTasks = 10;

    foreach(i; parallel(iota(0, totalTasks)))
    {
        ulong beg = i * iter / totalTasks;
        ulong end = beg + iter / totalTasks;

        atomicOp!"+="(sum, adder(beg, end));
    }

Ali

Reply via email to