Adapting programs to make optimal use of modern multicore chips is non-trivial and depends very much on the algorithms employed, there is no magic bullet. First one needs to understand Amdahl's law. Assume we have a chip with four cores such as the widely used Intel i7, amd the program consists of two parts that would each take one minute on a single core machine, i.e. the total time taken is 2 minutes. If we succeed in making part 1 fully parallel and part2 is not parallel, then the time required will be 0.25+1.0 minutes = 1.25 minutes. However many cores we have, we will never reduce the total time to less than 1 minute!. So it is important to make ALL rate
determining stages parallel.

However this is only approximately what happens, because:

1) the i7 uses hyperthreading, so it can run 2 threads on each core. However they share the number crunching unit, so this only helps if the threads frequently have to wait, e.g. to get data from the hard disk. For efficient number crunching code the hyperthreading
does not help much.

2) The i7 actually increases its clock frequency if it is running only one thread, because the critical factor is the amount of heat generated. So the speed-up when using multiple cores is smaller than one would expect.

3) In special cases (which I have never managed to achieve) the computer goes 'super-scalar'. This is because each core has its own cache memory, so by dividing up a large matrix (too big for one cache) so that it is all held in cache, the speedup can be more than expected for the number of cores, because cache access is much faster than RAM.

The key to writing efficient parallel code is to reduce the communication between threads to an absolute minimum. By doing this in the program shelxd (heavy atom location for experimental phasing) I was able to achieve a 27 times speedup on a 32 core machine. However for my other openmp programs the gain was much more modest. For some of them there is little advantage in using more
than about 4 cores, primarily because of Amdahl's law.

George


On 12/18/2013 07:50 PM, Marcin Wojdyr wrote:
On Tue, Dec 17, 2013 at 03:32:52PM +0000, Adam Ralph wrote:
Dear Chang,

     Some CCP4 progs can be used with a multi-core machine,
using OpenMP threads (including refmac it would appear). You will
I think only phaser and aimless.
Of course using 4 cores doesn't mean running 4 times faster
(it's more like ~2x faster for Phaser).

need to compile the code from source rather than taking the binary
versions
These programs are already compiled with openMP in CCP4

     Even if the CCP4 apps are not parallel themselves, they can access
a parallel version of libraries e.g. FFTW, LAPACK. Again you will
probably need to compile CCP4 from source and link with the correct
libraries.
It's possible, but I doubt it it will make noticeable difference.
Refmac runs don't spend much time in LAPACK. Probably the same with
FFTW which is used by programs that use Kevin's clipper.

One thing that can make a big difference is env. variable
GFORTRAN_UNBUFFERED_ALL. It shouldn't be set. If it is set, some
programs run a few times slower.

Marcin


--
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-33021 or -33068
Fax. +49-551-39-22582

Reply via email to