Adapting programs to make optimal use of modern multicore chips is
non-trivial and depends very much on the algorithms employed,
there is no magic bullet. First one needs to understand Amdahl's law.
Assume we have a chip with four cores such as the widely used
Intel i7, amd the program consists of two parts that would each take one
minute on a single core machine, i.e. the total time taken is 2
minutes. If we succeed in making part 1 fully parallel and part2 is not
parallel, then the time required will be 0.25+1.0 minutes = 1.25
minutes. However many cores we have, we will never reduce the total time
to less than 1 minute!. So it is important to make ALL rate
determining stages parallel.
However this is only approximately what happens, because:
1) the i7 uses hyperthreading, so it can run 2 threads on each core.
However they share the number crunching unit, so this only helps
if the threads frequently have to wait, e.g. to get data from the hard
disk. For efficient number crunching code the hyperthreading
does not help much.
2) The i7 actually increases its clock frequency if it is running only
one thread, because the critical factor is the amount of heat
generated. So the speed-up when using multiple cores is smaller than one
would expect.
3) In special cases (which I have never managed to achieve) the computer
goes 'super-scalar'. This is because each core has its
own cache memory, so by dividing up a large matrix (too big for one
cache) so that it is all held in cache, the speedup can be more
than expected for the number of cores, because cache access is much
faster than RAM.
The key to writing efficient parallel code is to reduce the
communication between threads to an absolute minimum. By doing this in
the program shelxd (heavy atom location for experimental phasing) I was
able to achieve a 27 times speedup on a 32 core machine.
However for my other openmp programs the gain was much more modest. For
some of them there is little advantage in using more
than about 4 cores, primarily because of Amdahl's law.
George
On 12/18/2013 07:50 PM, Marcin Wojdyr wrote:
On Tue, Dec 17, 2013 at 03:32:52PM +0000, Adam Ralph wrote:
Dear Chang,
Some CCP4 progs can be used with a multi-core machine,
using OpenMP threads (including refmac it would appear). You will
I think only phaser and aimless.
Of course using 4 cores doesn't mean running 4 times faster
(it's more like ~2x faster for Phaser).
need to compile the code from source rather than taking the binary
versions
These programs are already compiled with openMP in CCP4
Even if the CCP4 apps are not parallel themselves, they can access
a parallel version of libraries e.g. FFTW, LAPACK. Again you will
probably need to compile CCP4 from source and link with the correct
libraries.
It's possible, but I doubt it it will make noticeable difference.
Refmac runs don't spend much time in LAPACK. Probably the same with
FFTW which is used by programs that use Kevin's clipper.
One thing that can make a big difference is env. variable
GFORTRAN_UNBUFFERED_ALL. It shouldn't be set. If it is set, some
programs run a few times slower.
Marcin
--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-33021 or -33068
Fax. +49-551-39-22582