Barry Smith <[email protected]> writes: >> On May 29, 2015, at 2:29 PM, Jed Brown <[email protected]> wrote: >> >> Barry Smith <[email protected]> writes: >> >>> I cannot explain why the load balance would be 1.0 unless, by >>> unlikely coincidence on the 248 different calls to the function >>> different processes are the ones waiting so that the sum of the >>> waits on different processes matches over the 248 calls. Possible >>> but >> >> Uh, it's the same reason VecNorm often shows significant load imbalance. > > Uh, I don't understand. It shows NO imbalance but huge > times. Normally I would expect a large imbalance and huge times. So > I cannot explain why it has no imbalance. 1.0 means no imbalance.
Sorry, I mixed two comments. There are two non-scalable operations, determining ownership for outgoing entries (the loop I showed) and the huge MPI_Allreduce. Our timers can never observe load imbalance in the ownership determination because all that work is done after the timer has started and the timer can't end until after the MPI_Allreduce. The MPI_Allreduce is really expensive because it involves 1-2 MB from each of 128k cores. (MPI_Reduce_scatter_block is much better.) If incoming load imbalance is small relative to the ownership determination plus MPI_Allreduce, then we see 1.0. Putting a barrier before (sort of) guarantees that, but if it was already the case, the barrier won't change anything.
signature.asc
Description: PGP signature
