I just built asperl 5.12.3 on win32 and the performance on
4 CPUs with the number of threads set to 4 and the calculation
to $a **= 1.3 got almost 4X speedup.  Fantastic!

--Chris

On Mon, Sep 12, 2011 at 10:32 AM, Chris Marshall <[email protected]> wrote:
> Just pushed the ':hireswallclock' version of t/pthread.t
> and t/pthread_auto.t to PDL git.  Thanks again, Dima.
>
> --Chris
>
> On Mon, Sep 12, 2011 at 7:50 AM, chm <[email protected]> wrote:
>> On 9/12/2011 5:00 AM, Dima Kogan wrote:
>>>>
>>>> On Sun, 11 Sep 2011 10:22:30 -0400
>>>> chm<[email protected]>  wrote:
>>>>
>>>> Has anyone seen performance benefit from the
>>>> new auto pthread capability?
>>>>
>>>> When I run the t/pthread_auto.t test on an
>>>> AMD Athlon(tm) X2 Dual Core machine I see no
>>>> win from pthreads.  It would seem that the
>>>> performance gain might depend on the complexity
>>>> of the calculation being threaded and on the
>>>> number of cores.
>>>>
>>>> Data points anyone?
>>>>
>>>> --Chris
>>>
>>> Hi.
>>>
>>> First off, the test was broken, but it seems you already fixed it
>>> (unthreaded
>>> control case was actually set to 10-way threaded). I just ran some
>>> experiments
>>> to see just how beneficial extra threads are, and it is clear that the
>>> benchmarking reported by the test is misleading. It reports the wall-clock
>>> timing with a resolution of 1 second (way too coarse to be useful) and a
>>> user
>>> timing with a resolution of 0.01 seconds. The user timing counts CPU time,
>>> so
>>> it's USELESS here. If 5 cores each spend 1 second doing something, the
>>> user
>>> timing would be 5 seconds, even though the whole point of the automatic
>>> threading was to reduce wall-clock timing by increasing user timing.
>>
>> Thanks for investigating and clearing things up.
>>
>>> I increased the resolution of the wall-clock timing by replacing the 'use
>>> Benchmark' in the test header to
>>>
>>> use Benchmark ':hireswallclock';
>>> use Time::HiRes;
>>>
>>> If it's acceptable to require that Time::HiRes is available, we should
>>> make this
>>> change permanent I think.
>>
>> According to the docs, just use the new Benchmark line.
>> You don't need to use Time::HiRes at all.  Benchmark
>> will quietly fall back to standard timing if Time::HiRes
>> is not available.  No dependencies required.
>>
>>> This gives us useful wall-clock numbers, so I ran some tests to see how
>>> adding
>>> threads affects the computation time. I did this with the stock
>>> computation in
>>> the test ( $a += 1 ) and a more complicated computation to try to reduce
>>> the
>>> overhead costs ( $a = random(2000000); $a **= 1.3 ). The timings were done
>>> on a
>>> recent 8-core Intel machine running a recent Debian/unstable install.
>>> Wall-clock
>>> timings:
>>>
>>>
>>> | set_autopthread_targ | += 1 (500 times) | **= 1.3 (10 times) |
>>> |----------------------+------------------+--------------------|
>>> |                    0 |             1.90 |               2.15 |
>>> |                    1 |             1.90 |               2.15 |
>>> |                    2 |             1.17 |               1.10 |
>>> |                    3 |             1.15 |               1.10 |
>>> |                    4 |             0.91 |               0.56 |
>>> |                    5 |             0.89 |               0.45 |
>>> |                    6 |             0.90 |               0.45 |
>>> |                    7 |             0.90 |               0.46 |
>>> |                    8 |             0.80 |               0.29 |
>>> |                    9 |             0.80 |               0.29 |
>>> |                   10 |             0.93 |               0.39 |
>>>
>>> We can clearly see that extra threads make things go quicker. We can
>>> clearly see
>>> that the heavier computation benefits more from extra threads (lower
>>> relative
>>> overhead costs to maintain the threads). There's an interesting discrete
>>> nature
>>> to the improvement: adding a 4th thread makes a huge difference, while
>>> adding a
>>> 3rd doesn't at all. This may be due to the way the auto-threading is
>>> implemented. We can also see that when we have more threads than cores,
>>> the
>>> extra threads are a burden, not an improvement.
>>
>> Mystery solved!  I guess getting pthreads working for win32 will
>> be worth it after all.
>>
>> Thanks,
>> Chris
>>
>

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to