Hi Tony,
Several folk have gently 'warned me off' this idea (to parallel-ise the
FFT), and I can see now that they are probably right, it's insanely
difficult, especially with Georges code being written in such highly
optimized assembler.
But, I'm sure this is a problem that should be tackled sometime, since
moores-law of increasing performance is now being realised not by ever
higher clock rates, but instead by replication of processing cores. So,
a fast PC in 2008/9 might still clock at 3Ghz or so, but could have
eight CPU cores, each executing instructions at 3Ghz.
In this multi-core scenario, one could use the available compute
throughput by running 8 concurrent instances of Prime95 (which is
beneficial by testing more potential primes at a time) or by running one
Prime95 with the FFT code split over 8 cores (which would be beneficial
by testing one prime in a shorter timeframe)
It seems to me that as the prime numbers being searched become ever more
gigantic, and the time to run a single Lucas-Lehmer test stretches into
months, it would be really beneficial to be able to devote several CPU
cores to the task of testing one number. But I can see this is
technically quite a challenge...
By the way, congrats to the GIMPs team for possible discovery of the
44th Mersenne!
Peter Moreton
Getronics UK Ltd
As Matthias and George said, "'decompose' the Prime95 Lucas-Lehmer code
over 4 CPU cores" will certainly be very difficult.
I've already suggested George to think about that because more and more
processors have several cores and because soon it will take many months
on a fast machine to crunch the Mersenne numbers with big exponents.
However, "'decompose' the C Lucas-Lehmer code over N CPU cores" has
already being done by Ernst Mayer (MLucas, with MPI/OpenMP) and
Guillermo Valor (GLucas, with POSIX threads and MPI or OpenMP, don't
remember which one).
I had helped Guillermo to fix some small problems with threads, and I've
done some measurements of the scalability of GLucas on ia64 and PPC NUMA
architecture.
If you look at thread:
http://www.mersenneforum.org/showthread.php?t=5427
and at figure in post 11, you'll see that //ing a FFT is not 100%
scalable, even on very scalable IBM PPC machines, because some work has
to be done by a single thread for each iteration.
So, at least, if you have time to spend on this, I guess you should talk
with Ernst, Guillermo and George.
I would be very pleased to use such a //ed Prime95 ! even more if
scalability is quite good (more than 95% on 4 CPUs).
Bon courage !
Tony
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime