On Saturday 02 April 2005 12:01, Greg Edwards wrote:

> Can you split an L-L test into parallel threads over say 8 cpu's, and get
> the answer ~ 8 times faster ?
> I had thought that the L-L algorithm was strictly sequential.

The _algorithm_ is strictly sequential but the implementation needn't be - 
because, in R(n+1) = R(n)^2-2, the R's are multiprecision and therefore not 
single operations.

The point is that the squaring operation needs to be broken into pieces which 
can be, to some extent, executed in parallel. You won't get the full 8x but 
you can get a very useful speed increase over a linear uniprocessor 
implementation.

It's not that easy to illustrate the way the FFT works but take the analogy 
of executing a double length multiplication by conventional long 
multiplication algorithm:

(Ax+B)(Cx+D) = ACx^2 + (BC+AD)x + BD

Now the x, x^2 are just multiples of the word length so cost nothing.

A uniprocessor implementation has to execute 4 single-length multiplications 
in series but a 4 processor system could do the 4 multiplications in parallel 
leaving only the additions to run serially. (Don't forget that the high half 
of BD would overlap the low half of (BC+AD)x, similarly the high half of 
(BC+AD)x would overlap the low half of ACx^2 in the same way, so there's 
actually 4 additions to do as well as 4 multiplies). The additions have to be 
done serially as carries across word boundaries get involved.

Assume for the sake of argument that a single-length multiply takes the same 
length as a double-length addition. The uniprocessor implementation would 
take 8T whereas the 4 processor implementation would take 5T.

If "single length" is already a multiple of the word length, the saving is 
greater as (parellizable) multiplication will be a much lengthier operation 
than (necessarily serial) addition. It doesn't take too much lengthening of 
the operands for the saving to approach (but never quite reach) 75%, even 
when the extra task scheduler overheads involved in managing parallel threads 
are factored in.

Without trying to explain FFT in detail, the individual "butterfly" 
operations can be run in parallel with only minimal interference from 
"carries" resulting from other threads.
>
> To answer Max: yes I'm sure for throughput I should just run 8 copies of
> single-thread L-L, and I hope to be able to try that.

Yes. This will always give the greatest throughput - assuming you have 
adequate memory; if anything has to stray onto a swap file then the 
performance will take a _huge_ hit! Shouldn't be a problem with LL testing, 
though; multiprocessor systems are almost certainly going to feature 
gigabytes of memory rather than the 256 MB or so you'd need to run 8 parallel 
independent LLs on very large exponents.

Another practical problem arises from parallel threads running on seperate 
processors potentially trashing each others' cache contents, which wastes 
processor cycles whilst cache lines are refreshed. With care in 
implementation this can be minimized though probably at the expense of 
stricter synchronization between threads and/or leaving more serial 
operations to be done. The balance is almost certainly going to be critically 
dependent on the implementation being tuned to the exact cache/memory 
characteristics of the hardware platform, as well as the efficiency of the 
task scheduler provided by the operating system.

Regards
Brian Beesley
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to