Re: [OMPI devel] subcommunicator OpenMPI issues on K

Samuel Williams Tue, 07 Nov 2017 08:53:52 -0800

Although splitting into sqrt(p) teams of sqrt(p) is common for 2D SpMV (or 3D 
for LU/SparseLU), splitting by coarse grained parallelism (e.g. multiple 
concurrent solves on multiple RHS) is also possible.  Here, the teams are 
<O(P/10) in size and thus you may still be sensitive to P^2 complexity.  For 
multigrid with agglomeration (or any algorithm where the number of active 
processes fall exponentially), you may have some splits where ~all the 
processes have the same color.


On K, if I recall, the ~14 comm splits all took the same amount of time (~15s) 
regardless of how many processes had the same color.



> On Nov 7, 2017, at 9:23 AM, George Bosilca <[email protected]> wrote:
> 
> Samuel,
> 
> You are right, we use qsort to sort the keys, but the qsort only applies on 
> participants with the same color. So while the complexity of the qsort might 
> reach bottom only when most of the processes participate with the same color.
> 
> What I think is OMPI problem in this are is the selection of the next cid for 
> the newly created communicator. We are doing the selection of the cid on the 
> original communicator, and this basically counts for a significant increase 
> in the duration, as will need to iterate a longer to converge to a common cid.
> 
> We haven't made any improvement in this area for the last few years, we 
> simply transformed the code to use non-blocking communications instead of 
> blocking, but this has little impact on the performance of the split itself.
> 
>   George.
> 
> 
> On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <[email protected]> wrote:
> I'll ask my collaborators if they've submitted a ticket.
> (they have the accounts; built the code; ran the code; observed the issues)
> 
> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.  
> When this is coupled with the fact that it looked like qsort is called in 
> ompi_comm_split 
> (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
>  I wanted to raise the issue so that it may be investigated to understand 
> whether users can naively blunder into worst case computational complexity 
> issues.
> 
> We've been running hpgmg-fv (not -fe).  They were using the flux variants 
> (requires local.mk build operators.flux.c instead of operators.fv4.c) and 
> they are a couple commits behind.  Regardless, this issue has persisted on K 
> for several years.  By default, it will build log(N) subcommunicators where N 
> is the problem size.  Weak scaling experiments has shown comm_split/dup times 
> growing consistently with worst case complexity.  That being said, AMR codes 
> might rebuild the sub communicators as they regrid/adapt.
> 
> 
> 
> 
> 
> 
> 
> 
> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet 
> > <[email protected]> wrote:
> >
> > Samuel,
> >
> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
> > is based on Open MPI.
> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> > From a support perspective, this should be reported to the HPCI
> > helpdesk http://www.hpci-office.jp/pages/e_support
> >
> > As far as i understand, Fujitsu MPI currently available on K is not
> > based on the latest Open MPI.
> > I suspect most of the time is spent trying to find the new
> > communicator ID (CID) when a communicator is created (vs figuring out
> > the new ranks)
> > iirc, on older versions of Open MPI, that was implemented with as many
> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> > unused CID for the newly created communicator.
> >
> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> > your program, only one MPI_Allreduce() should be involved per
> > MPI_Comm_dup().
> > But if you do the same thing in the middle of your run, and after each
> > rank has a different lower unused CID, the performances can be (much)
> > worst.
> > If i understand correctly your description of the issue, that would
> > explain the performance discrepancy between static vs dynamic
> > communicator creation time.
> >
> > fwiw, this part has been (highly) improved in the latest releases of Open 
> > MPI.
> >
> > If your benchmark is available for download, could you please post a link ?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <[email protected]> wrote:
> >> Some of my collaborators have had issues with one of my benchmarks at high 
> >> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> >> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> >> increasing quadratically with process concurrency.  At 82K processes, each 
> >> call to dup/split is taking 15s to complete.  These high times restrict 
> >> comm_split/dup to be used statically (at the beginning) and not 
> >> dynamically in an application.
> >>
> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> >> qsort to split the ranks.  Although qsort/quicksort has ideal 
> >> computational complexity of O(PlogP)  [P is the number of MPI ranks], it 
> >> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x 
> >> slowdown.
> >>
> >> Can you confirm whether qsort (or the like) is (still) used in these 
> >> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP) 
> >> would be a more scalable approach.  I have not observed this issue on the 
> >> Cray MPICH implementation and the Mira MPICH issues has since been 
> >> resolved.
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> [email protected]
> >> https://lists.open-mpi.org/mailman/listinfo/devel
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] subcommunicator OpenMPI issues on K

Reply via email to