I am running some performance tests on a distributed cluster each node 16 cores (Cray). I am very surprised to find that my benchmark jobs are about 3x slower when running on N nodes using all 16 cores than when running on N*16 nodes using only one core. I find this using 2 independent petsc builds and they both exibit the same behavior: my own gnu build and the system module petsc, both 3.2. I was so far unable to build my own petsc version with cray compilers to compare.
The scheme is relatively complex with a shell matrix and block preconditioners, transient non-linear problem. I am using boomeramg from hypre. What do you think this unexpected performance may come from? Is it possible that the node interconnect is faster than the shared memory bus on the node? I was expecting the exact opposite. Thanks for any opinions. Dominik
