I'm running my PETSc code on a cluster of quad core Xeon's connected by Infiniband. I hadn't much worried about the performance, because everything seemed to be working quite well, but today I was actually comparing performance (wall clock time) for the same problem, but on different combinations of CPUS.
I find that my PETSc code is quite scalable until I start to use multiple cores/cpu. For example, the run time doesn't improve by going from 1 core/cpu to 4 cores/cpu, and I find this to be very strange, especially since looking at top or Ganglia, all 4 cpus on each node are running at 100% almost all of the time. I would have thought if the cpus were going all out, that I would still be getting much more scalable results. We are using mvapich-0.9.9 with infiniband. So, I don't know if this is a cluster/Xeon issue, or something else. Anybody with experience on this? Thanks, Randy M.
