That load imbalance often comes from whatever came *before* the reduction. On May 26, 2012 5:25 PM, "Mark F. Adams" <mark.adams at columbia.edu> wrote:
> Just a few comments on the perf data. I see a lot (~5x) load imbalance on > reduction stuff like VecMDot and VecNorm (Time Ratio, max/min) but even a > lot of imbalance on simple non-collective vector ops like VecAXPY (~3x). > So if the work is load balanced well then I would suspect its a NUMA issue. > > Mark > > On May 26, 2012, at 6:13 PM, Aron Roland wrote: > > Dear All, > > I have some question on some recent implementation of PETSc for solving a > large linear system from a 4d problem on hybrid unstructured meshes. > > The point is that we have implemented all the mappings and the solution is > fine, the number of iterations too. The results are robust with respect to > the amount of CPU used but we have a scaling issue. > > The system is an intel cluster of the latest generation on Infiniband. > > We have attached the summary ... with hooefully a lot of informations. > > Any comments, suggestions, ideas are very welcome. > > We have been reading the threads with that are dealing with multi-core and > the bus-limitation stuff, so we are aware of this. > > I am thinking now on an open/mpi hybrid stuff but I am not quite happy > with the bus-limitation explanation, most of the systems are multicore. > > I hope the limitation are not the sparse matrix mapping that we are using > ... > > Thanks in advance ... > > Cheers > > Aron > > > > * > * > <benchmark.txt> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120526/c039586a/attachment-0001.html>
