Then what's the point of having 4 and 8 cores per cpu for parallel computations then? I mean, I think I've done all I can to make my code as efficient as possible.
I'm not quite sure I understand your comment about using blocks or unassembled structures. Randy Matthew Knepley wrote: > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com> > wrote: >> I'm running my PETSc code on a cluster of quad core Xeon's connected >> by Infiniband. I hadn't much worried about the performance, because >> everything seemed to be working quite well, but today I was actually >> comparing performance (wall clock time) for the same problem, but on >> different combinations of CPUS. >> >> I find that my PETSc code is quite scalable until I start to use >> multiple cores/cpu. >> >> For example, the run time doesn't improve by going from 1 core/cpu >> to 4 cores/cpu, and I find this to be very strange, especially since >> looking at top or Ganglia, all 4 cpus on each node are running at 100% >> almost >> all of the time. I would have thought if the cpus were going all out, >> that I would still be getting much more scalable results. > > Those a really coarse measures. There is absolutely no way that all cores > are going 100%. Its easy to show by hand. Take the peak flop rate and > this gives you the bandwidth needed to sustain that computation (if > everything is perfect, like axpy). You will find that the chip bandwidth > is far below this. A nice analysis is in > > http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf > >> We are using mvapich-0.9.9 with infiniband. So, I don't know if >> this is a cluster/Xeon issue, or something else. > > This is actually mathematics! How satisfying. The only way to improve > this is to change the data structure (e.g. use blocks) or change the > algorithm (e.g. use spectral elements and unassembled structures) > > Matt > >> Anybody with experience on this? >> >> Thanks, Randy M. >> >> > > >
