My guess would be that both aspects (sorting + CID allocation) could be a problem.

There was a loong time back an effort to convert the sequence of allgather + qsort into a distributed sort (based on a paper by Moody et. al. where he demonstrated the benefits of this approach).  We didn't get unfortunately to implement this part due to timing constraints.

Also, we used to have a block allocation algorithm for CIDs that did save quite a bit in communication costs, but we had to abandon it because the implementation was wasting CIDs by not properly reusing them. It is fairly clear on how to implement it correctly, it just hasn't been done for a lack of resources.

Maybe we should revisit those items at one point in time.
Thanks
Edgar

On 11/7/2017 10:23 AM, George Bosilca wrote:
Samuel,

You are right, we use qsort to sort the keys, but the qsort only applies on participants with the same color. So while the complexity of the qsort might reach bottom only when most of the processes participate with the same color.

What I think is OMPI problem in this are is the selection of the next cid for the newly created communicator. We are doing the selection of the cid on the original communicator, and this basically counts for a significant increase in the duration, as will need to iterate a longer to converge to a common cid.

We haven't made any improvement in this area for the last few years, we simply transformed the code to use non-blocking communications instead of blocking, but this has little impact on the performance of the split itself.

  George.


On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov <mailto:swwilli...@lbl.gov>> wrote:

    I'll ask my collaborators if they've submitted a ticket.
    (they have the accounts; built the code; ran the code; observed
    the issues)

    I believe the issue on MPICH was a qsort issue and not a Allreduce
    issue.  When this is coupled with the fact that it looked like
    qsort is called in ompi_comm_split
    
(https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c
    
<https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c>),
    I wanted to raise the issue so that it may be investigated to
    understand whether users can naively blunder into worst case
    computational complexity issues.

    We've been running hpgmg-fv (not -fe).  They were using the flux
    variants (requires local.mk <http://local.mk> build
    operators.flux.c instead of operators.fv4.c) and they are a couple
    commits behind.  Regardless, this issue has persisted on K for
    several years.  By default, it will build log(N) subcommunicators
    where N is the problem size.  Weak scaling experiments has shown
    comm_split/dup times growing consistently with worst case
    complexity.  That being said, AMR codes might rebuild the sub
    communicators as they regrid/adapt.








    > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet
    <gilles.gouaillar...@gmail.com
    <mailto:gilles.gouaillar...@gmail.com>> wrote:
    >
    > Samuel,
    >
    > The default MPI library on the K computer is Fujitsu MPI, and
    yes, it
    > is based on Open MPI.
    > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
    > From a support perspective, this should be reported to the HPCI
    > helpdesk http://www.hpci-office.jp/pages/e_support
    <http://www.hpci-office.jp/pages/e_support>
    >
    > As far as i understand, Fujitsu MPI currently available on K is not
    > based on the latest Open MPI.
    > I suspect most of the time is spent trying to find the new
    > communicator ID (CID) when a communicator is created (vs
    figuring out
    > the new ranks)
    > iirc, on older versions of Open MPI, that was implemented with
    as many
    > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
    > unused CID for the newly created communicator.
    >
    > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
    > your program, only one MPI_Allreduce() should be involved per
    > MPI_Comm_dup().
    > But if you do the same thing in the middle of your run, and
    after each
    > rank has a different lower unused CID, the performances can be
    (much)
    > worst.
    > If i understand correctly your description of the issue, that would
    > explain the performance discrepancy between static vs dynamic
    > communicator creation time.
    >
    > fwiw, this part has been (highly) improved in the latest
    releases of Open MPI.
    >
    > If your benchmark is available for download, could you please
    post a link ?
    >
    >
    > Cheers,
    >
    > Gilles
    >
    > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams
    <swwilli...@lbl.gov <mailto:swwilli...@lbl.gov>> wrote:
    >> Some of my collaborators have had issues with one of my
    benchmarks at high concurrency (82K MPI procs) on the K machine in
    Japan.  I believe K uses OpenMPI and the issues has been tracked
    to time in MPI_Comm_dup/Comm_split increasing quadratically with
    process concurrency.  At 82K processes, each call to dup/split is
    taking 15s to complete.  These high times restrict comm_split/dup
    to be used statically (at the beginning) and not dynamically in an
    application.
    >>
    >> I had a similar issue a few years ago on ANL/Mira/MPICH where
    they called qsort to split the ranks.  Although qsort/quicksort
    has ideal computational complexity of O(PlogP)  [P is the number
    of MPI ranks], it can have worst case complexity of O(P^2)... at
    82K, P/logP is a 5000x slowdown.
    >>
    >> Can you confirm whether qsort (or the like) is (still) used in
    these routines in OpenMPI?  It seems mergesort (worst case
    complexity of PlogP) would be a more scalable approach.  I have
    not observed this issue on the Cray MPICH implementation and the
    Mira MPICH issues has since been resolved.
    >>
    >>
    >> _______________________________________________
    >> devel mailing list
    >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    >> https://lists.open-mpi.org/mailman/listinfo/devel
    <https://lists.open-mpi.org/mailman/listinfo/devel>
    > _______________________________________________
    > devel mailing list
    > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    > https://lists.open-mpi.org/mailman/listinfo/devel
    <https://lists.open-mpi.org/mailman/listinfo/devel>

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/devel
    <https://lists.open-mpi.org/mailman/listinfo/devel>



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to