Although splitting into sqrt(p) teams of sqrt(p) is common for 2D SpMV (or 3D for LU/SparseLU), splitting by coarse grained parallelism (e.g. multiple concurrent solves on multiple RHS) is also possible. Here, the teams are <O(P/10) in size and thus you may still be sensitive to P^2 complexity. For multigrid with agglomeration (or any algorithm where the number of active processes fall exponentially), you may have some splits where ~all the processes have the same color.
On K, if I recall, the ~14 comm splits all took the same amount of time (~15s) regardless of how many processes had the same color. > On Nov 7, 2017, at 9:23 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Samuel, > > You are right, we use qsort to sort the keys, but the qsort only applies on > participants with the same color. So while the complexity of the qsort might > reach bottom only when most of the processes participate with the same color. > > What I think is OMPI problem in this are is the selection of the next cid for > the newly created communicator. We are doing the selection of the cid on the > original communicator, and this basically counts for a significant increase > in the duration, as will need to iterate a longer to converge to a common cid. > > We haven't made any improvement in this area for the last few years, we > simply transformed the code to use non-blocking communications instead of > blocking, but this has little impact on the performance of the split itself. > > George. > > > On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov> wrote: > I'll ask my collaborators if they've submitted a ticket. > (they have the accounts; built the code; ran the code; observed the issues) > > I believe the issue on MPICH was a qsort issue and not a Allreduce issue. > When this is coupled with the fact that it looked like qsort is called in > ompi_comm_split > (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), > I wanted to raise the issue so that it may be investigated to understand > whether users can naively blunder into worst case computational complexity > issues. > > We've been running hpgmg-fv (not -fe). They were using the flux variants > (requires local.mk build operators.flux.c instead of operators.fv4.c) and > they are a couple commits behind. Regardless, this issue has persisted on K > for several years. By default, it will build log(N) subcommunicators where N > is the problem size. Weak scaling experiments has shown comm_split/dup times > growing consistently with worst case complexity. That being said, AMR codes > might rebuild the sub communicators as they regrid/adapt. > > > > > > > > > > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet > > <gilles.gouaillar...@gmail.com> wrote: > > > > Samuel, > > > > The default MPI library on the K computer is Fujitsu MPI, and yes, it > > is based on Open MPI. > > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */ > > From a support perspective, this should be reported to the HPCI > > helpdesk http://www.hpci-office.jp/pages/e_support > > > > As far as i understand, Fujitsu MPI currently available on K is not > > based on the latest Open MPI. > > I suspect most of the time is spent trying to find the new > > communicator ID (CID) when a communicator is created (vs figuring out > > the new ranks) > > iirc, on older versions of Open MPI, that was implemented with as many > > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common > > unused CID for the newly created communicator. > > > > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of > > your program, only one MPI_Allreduce() should be involved per > > MPI_Comm_dup(). > > But if you do the same thing in the middle of your run, and after each > > rank has a different lower unused CID, the performances can be (much) > > worst. > > If i understand correctly your description of the issue, that would > > explain the performance discrepancy between static vs dynamic > > communicator creation time. > > > > fwiw, this part has been (highly) improved in the latest releases of Open > > MPI. > > > > If your benchmark is available for download, could you please post a link ? > > > > > > Cheers, > > > > Gilles > > > > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> wrote: > >> Some of my collaborators have had issues with one of my benchmarks at high > >> concurrency (82K MPI procs) on the K machine in Japan. I believe K uses > >> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split > >> increasing quadratically with process concurrency. At 82K processes, each > >> call to dup/split is taking 15s to complete. These high times restrict > >> comm_split/dup to be used statically (at the beginning) and not > >> dynamically in an application. > >> > >> I had a similar issue a few years ago on ANL/Mira/MPICH where they called > >> qsort to split the ranks. Although qsort/quicksort has ideal > >> computational complexity of O(PlogP) [P is the number of MPI ranks], it > >> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x > >> slowdown. > >> > >> Can you confirm whether qsort (or the like) is (still) used in these > >> routines in OpenMPI? It seems mergesort (worst case complexity of PlogP) > >> would be a more scalable approach. I have not observed this issue on the > >> Cray MPICH implementation and the Mira MPICH issues has since been > >> resolved. > >> > >> > >> _______________________________________________ > >> devel mailing list > >> devel@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel