On Fri, Jan 20, 2012 at 15:27, Dominik Szczerba <dominik at itis.ethz.ch>wrote:
> I am running some performance tests on a distributed cluster each node > 16 cores (Cray). > I am very surprised to find that my benchmark jobs are about 3x slower when > running on N nodes using all 16 cores than when running on N*16 nodes > using only one core. > Yes, this is normal. Memory bandwidth is the overwhelming bottleneck for most sparse linear algebra. One core can almost saturate the bandwidth of a socket, so you see little benefit from the extra cores. Pay attention to memory bandwidth when you buy computers and try to make your algorithms use a lot of flops per memory access if you want to utilize the floating point hardware you have lying around. > I find this using 2 independent petsc builds and > they both exibit the same behavior: my own gnu > build and the system module petsc, both 3.2. I was so far unable to > build my own petsc version with cray compilers to compare. > > The scheme is relatively complex with a shell matrix and block > preconditioners, transient non-linear problem. I am using boomeramg > from hypre. > > What do you think this unexpected performance may come from? Is it > possible that the node interconnect is faster than the shared memory > bus on the node? I was expecting the exact opposite. > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120120/34783652/attachment-0001.htm>
