On Saturday 02 April 2005 12:01, Greg Edwards wrote: > Can you split an L-L test into parallel threads over say 8 cpu's, and get > the answer ~ 8 times faster ? > I had thought that the L-L algorithm was strictly sequential.
The _algorithm_ is strictly sequential but the implementation needn't be - because, in R(n+1) = R(n)^2-2, the R's are multiprecision and therefore not single operations. The point is that the squaring operation needs to be broken into pieces which can be, to some extent, executed in parallel. You won't get the full 8x but you can get a very useful speed increase over a linear uniprocessor implementation. It's not that easy to illustrate the way the FFT works but take the analogy of executing a double length multiplication by conventional long multiplication algorithm: (Ax+B)(Cx+D) = ACx^2 + (BC+AD)x + BD Now the x, x^2 are just multiples of the word length so cost nothing. A uniprocessor implementation has to execute 4 single-length multiplications in series but a 4 processor system could do the 4 multiplications in parallel leaving only the additions to run serially. (Don't forget that the high half of BD would overlap the low half of (BC+AD)x, similarly the high half of (BC+AD)x would overlap the low half of ACx^2 in the same way, so there's actually 4 additions to do as well as 4 multiplies). The additions have to be done serially as carries across word boundaries get involved. Assume for the sake of argument that a single-length multiply takes the same length as a double-length addition. The uniprocessor implementation would take 8T whereas the 4 processor implementation would take 5T. If "single length" is already a multiple of the word length, the saving is greater as (parellizable) multiplication will be a much lengthier operation than (necessarily serial) addition. It doesn't take too much lengthening of the operands for the saving to approach (but never quite reach) 75%, even when the extra task scheduler overheads involved in managing parallel threads are factored in. Without trying to explain FFT in detail, the individual "butterfly" operations can be run in parallel with only minimal interference from "carries" resulting from other threads. > > To answer Max: yes I'm sure for throughput I should just run 8 copies of > single-thread L-L, and I hope to be able to try that. Yes. This will always give the greatest throughput - assuming you have adequate memory; if anything has to stray onto a swap file then the performance will take a _huge_ hit! Shouldn't be a problem with LL testing, though; multiprocessor systems are almost certainly going to feature gigabytes of memory rather than the 256 MB or so you'd need to run 8 parallel independent LLs on very large exponents. Another practical problem arises from parallel threads running on seperate processors potentially trashing each others' cache contents, which wastes processor cycles whilst cache lines are refreshed. With care in implementation this can be minimized though probably at the expense of stricter synchronization between threads and/or leaving more serial operations to be done. The balance is almost certainly going to be critically dependent on the implementation being tuned to the exact cache/memory characteristics of the hardware platform, as well as the efficiency of the task scheduler provided by the operating system. Regards Brian Beesley _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
