Samuel,

You are right, we use qsort to sort the keys, but the qsort only applies on
participants with the same color. So while the complexity of the qsort
might reach bottom only when most of the processes participate with the
same color.

What I think is OMPI problem in this are is the selection of the next cid
for the newly created communicator. We are doing the selection of the cid
on the original communicator, and this basically counts for a significant
increase in the duration, as will need to iterate a longer to converge to a
common cid.

We haven't made any improvement in this area for the last few years, we
simply transformed the code to use non-blocking communications instead of
blocking, but this has little impact on the performance of the split itself.

  George.


On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov> wrote:

> I'll ask my collaborators if they've submitted a ticket.
> (they have the accounts; built the code; ran the code; observed the issues)
>
> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.
> When this is coupled with the fact that it looked like qsort is called in
> ompi_comm_split (https://github.com/open-mpi/ompi/blob/
> a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), I
> wanted to raise the issue so that it may be investigated to understand
> whether users can naively blunder into worst case computational complexity
> issues.
>
> We've been running hpgmg-fv (not -fe).  They were using the flux variants
> (requires local.mk build operators.flux.c instead of operators.fv4.c) and
> they are a couple commits behind.  Regardless, this issue has persisted on
> K for several years.  By default, it will build log(N) subcommunicators
> where N is the problem size.  Weak scaling experiments has shown
> comm_split/dup times growing consistently with worst case complexity.  That
> being said, AMR codes might rebuild the sub communicators as they
> regrid/adapt.
>
>
>
>
>
>
>
>
> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > Samuel,
> >
> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
> > is based on Open MPI.
> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> > From a support perspective, this should be reported to the HPCI
> > helpdesk http://www.hpci-office.jp/pages/e_support
> >
> > As far as i understand, Fujitsu MPI currently available on K is not
> > based on the latest Open MPI.
> > I suspect most of the time is spent trying to find the new
> > communicator ID (CID) when a communicator is created (vs figuring out
> > the new ranks)
> > iirc, on older versions of Open MPI, that was implemented with as many
> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> > unused CID for the newly created communicator.
> >
> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> > your program, only one MPI_Allreduce() should be involved per
> > MPI_Comm_dup().
> > But if you do the same thing in the middle of your run, and after each
> > rank has a different lower unused CID, the performances can be (much)
> > worst.
> > If i understand correctly your description of the issue, that would
> > explain the performance discrepancy between static vs dynamic
> > communicator creation time.
> >
> > fwiw, this part has been (highly) improved in the latest releases of
> Open MPI.
> >
> > If your benchmark is available for download, could you please post a
> link ?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov>
> wrote:
> >> Some of my collaborators have had issues with one of my benchmarks at
> high concurrency (82K MPI procs) on the K machine in Japan.  I believe K
> uses OpenMPI and the issues has been tracked to time in
> MPI_Comm_dup/Comm_split increasing quadratically with process concurrency.
> At 82K processes, each call to dup/split is taking 15s to complete.  These
> high times restrict comm_split/dup to be used statically (at the beginning)
> and not dynamically in an application.
> >>
> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they
> called qsort to split the ranks.  Although qsort/quicksort has ideal
> computational complexity of O(PlogP)  [P is the number of MPI ranks], it
> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x
> slowdown.
> >>
> >> Can you confirm whether qsort (or the like) is (still) used in these
> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP)
> would be a more scalable approach.  I have not observed this issue on the
> Cray MPICH implementation and the Mira MPICH issues has since been resolved.
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/devel
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to