I should have been clearer. This changes how the topo components are expected to work. With this change they now MUST provide the c_topo to the communicator before coll_select. The change is good but may be unexpected for any vendor topo components out there (which I doubt there are any).
-Nathan On Wed, Oct 01, 2014 at 12:52:00PM +0000, Jeff Squyres (jsquyres) wrote: > On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > > > oops, i missed graph and dist graph :-( > > > > the attached patch fixes this > > /* it was straightforward for graph and a bit more challenging for dist > > graph */ > > > > i am fine for Nathan's patch for v1.8 > > /* v1.8 and trunk have already a different way to handle a memory leak */ > > /* the attached patch is vs v1.8 but it is for review only, i will port > > it to the trunk */ > > > > imho, there is no ABI break with this patch. > > out of curiosity, what is the ABI break that is/will be introduced with > > the "real" fix ? > > I haven't looked at the code at all, but Nathan said it changed the topo > framework interface. That's the ABI I was referring to -- not the MPI ABI. > Sorry for the confusions. > > > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > > test suite in order to test > > graph and dist_graph > > > > Cheers, > > > > Gilles > > > > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: > >> On the call today, we decided that Nathan's v1.8 patch is sufficient for > >> the v1.8 series, and the "real" fix will be applied to the trunk and > >> carried forward to the v1.9 series (since it introduces an ABI break for > >> the topo framework, which we try not to do in the middle of a release > >> series). > >> > >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > >> > >>> Not quite right. There still is no topology information at collective > >>> selection time for either graph or dist graph. > >>> > >>> -Nathan > >>> > >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: > >>>> Nathan, > >>>> > >>>> here is a revision of the previously attached patch, and that supports > >>>> graph and dist graph. > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 2014/09/30 0:05, Nathan Hjelm wrote: > >>>> > >>>> An equivalent change would need to be made for graph and dist graph as > >>>> well. That will take a little more work. Also, I was avoiding changing > >>>> anything in topo for 1.8. > >>>> > >>>> -Nathan > >>>> > >>>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > >>>> > >>>> Nathan, > >>>> > >>>> why not just make the topology information available at that point as > >>>> you > >>>> described it ? > >>>> > >>>> the attached patch does this, could you please review it ? > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 2014/09/26 2:50, Nathan Hjelm wrote: > >>>> > >>>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > >>>> > >>>> I finally managed to track down some issues in mpi4py's test suite > >>>> using Open MPI 1.8+. The code below should be enough to reproduce the > >>>> problem. Run it under valgrind to make sense of my following > >>>> diagnostics. > >>>> > >>>> In this code I'm creating a 2D, periodic Cartesian topology out of > >>>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > >>>> links to itself. So we have size=1 but indegree=outdegree=4. However, > >>>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > >>>> being allocated to manage communication: > >>>> > >>>> if (OMPI_COMM_IS_INTER(comm)) { > >>>> size = ompi_comm_remote_size(comm); > >>>> } else { > >>>> size = ompi_comm_size(comm); > >>>> } > >>>> basic_module->mccb_num_reqs = size * 2; > >>>> basic_module->mccb_reqs = (ompi_request_t**) > >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > >>>> > >>>> I guess you have to also special-case for topologies and allocate > >>>> indegree+outdegree requests (not sure about this number, just > >>>> guessing). > >>>> > >>>> > >>>> I wish this was possible but the topology information is not available > >>>> at that point. We may be able to change that but I don't see the work > >>>> completing anytime soon. I committed an alternative fix as r32796 and > >>>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > >>>> produces a SEGV. Let me know if you run into any more issues. > >>>> > >>>> > >>>> -Nathan > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > >>>> > >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c > >>>> =================================================================== > >>>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) > >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > >>>> @@ -163,10 +163,18 @@ > >>>> return MPI_ERR_INTERN; > >>>> } > >>>> > >>>> + assert(NULL == new_comm->c_topo); > >>>> + assert(!(new_comm->c_flags & OMPI_COMM_CART)); > >>>> + new_comm->c_topo = topo; > >>>> + new_comm->c_topo->mtc.cart = cart; > >>>> + new_comm->c_topo->reorder = reorder; > >>>> + new_comm->c_flags |= OMPI_COMM_CART; > >>>> ret = ompi_comm_enable(old_comm, new_comm, > >>>> new_rank, num_procs, topo_procs); > >>>> if (OMPI_SUCCESS != ret) { > >>>> /* something wrong happened during setting the communicator */ > >>>> + new_comm->c_topo = NULL; > >>>> + new_comm->c_flags &= ~OMPI_COMM_CART; > >>>> ompi_comm_free (&new_comm); > >>>> free(topo_procs); > >>>> if(NULL != cart->periods) free(cart->periods); > >>>> @@ -176,10 +184,6 @@ > >>>> return ret; > >>>> } > >>>> > >>>> - new_comm->c_topo = topo; > >>>> - new_comm->c_topo->mtc.cart = cart; > >>>> - new_comm->c_topo->reorder = reorder; > >>>> - new_comm->c_flags |= OMPI_COMM_CART; > >>>> *comm_topo = new_comm; > >>>> > >>>> if( MPI_UNDEFINED == new_rank ) { > >>>> Index: ompi/mca/coll/basic/coll_basic_module.c > >>>> =================================================================== > >>>> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) > >>>> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > >>>> @@ -13,6 +13,8 @@ > >>>> * Copyright (c) 2012 Sandia National Laboratories. All rights > >>>> reserved. > >>>> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights > >>>> * reserved. > >>>> + * Copyright (c) 2014 Research Organization for Information Science > >>>> + * and Technology (RIST). All rights reserved. > >>>> * $COPYRIGHT$ > >>>> * > >>>> * Additional copyrights may follow > >>>> @@ -28,6 +30,7 @@ > >>>> #include "mpi.h" > >>>> #include "ompi/mca/coll/coll.h" > >>>> #include "ompi/mca/coll/base/base.h" > >>>> +#include "ompi/mca/topo/topo.h" > >>>> #include "coll_basic.h" > >>>> > >>>> > >>>> @@ -70,6 +73,15 @@ > >>>> } else { > >>>> size = ompi_comm_size(comm); > >>>> } > >>>> + if (comm->c_flags & OMPI_COMM_CART) { > >>>> + int cart_size; > >>>> + assert (NULL != comm->c_topo); > >>>> + comm->c_topo->topo.cart.cartdim_get(comm, &cart_size); > >>>> + cart_size *= 2; > >>>> + if (cart_size > size) { > >>>> + size = cart_size; > >>>> + } > >>>> + } > >>>> basic_module->mccb_num_reqs = size * 2; > >>>> basic_module->mccb_reqs = (ompi_request_t**) > >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15929.php > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15930.php > >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c > >>>> =================================================================== > >>>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32815) > >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > >>>> @@ -163,10 +163,18 @@ > >>>> return MPI_ERR_INTERN; > >>>> } > >>>> > >>>> + assert(NULL == new_comm->c_topo); > >>>> + assert(!(new_comm->c_flags & OMPI_COMM_CART)); > >>>> + new_comm->c_topo = topo; > >>>> + new_comm->c_topo->mtc.cart = cart; > >>>> + new_comm->c_topo->reorder = reorder; > >>>> + new_comm->c_flags |= OMPI_COMM_CART; > >>>> ret = ompi_comm_enable(old_comm, new_comm, > >>>> new_rank, num_procs, topo_procs); > >>>> if (OMPI_SUCCESS != ret) { > >>>> /* something wrong happened during setting the communicator */ > >>>> + new_comm->c_topo = NULL; > >>>> + new_comm->c_flags &= ~OMPI_COMM_CART; > >>>> ompi_comm_free (&new_comm); > >>>> free(topo_procs); > >>>> if(NULL != cart->periods) free(cart->periods); > >>>> @@ -176,10 +184,6 @@ > >>>> return ret; > >>>> } > >>>> > >>>> - new_comm->c_topo = topo; > >>>> - new_comm->c_topo->mtc.cart = cart; > >>>> - new_comm->c_topo->reorder = reorder; > >>>> - new_comm->c_flags |= OMPI_COMM_CART; > >>>> *comm_topo = new_comm; > >>>> > >>>> if( MPI_UNDEFINED == new_rank ) { > >>>> Index: ompi/mca/coll/basic/coll_basic_module.c > >>>> =================================================================== > >>>> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32815) > >>>> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > >>>> @@ -13,6 +13,8 @@ > >>>> * Copyright (c) 2012 Sandia National Laboratories. All rights > >>>> reserved. > >>>> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights > >>>> * reserved. > >>>> + * Copyright (c) 2014 Research Organization for Information Science > >>>> + * and Technology (RIST). All rights reserved. > >>>> * $COPYRIGHT$ > >>>> * > >>>> * Additional copyrights may follow > >>>> @@ -28,6 +30,8 @@ > >>>> #include "mpi.h" > >>>> #include "ompi/mca/coll/coll.h" > >>>> #include "ompi/mca/coll/base/base.h" > >>>> +#include "ompi/mca/topo/topo.h" > >>>> +#include "ompi/mca/topo/base/base.h" > >>>> #include "coll_basic.h" > >>>> > >>>> > >>>> @@ -70,7 +74,36 @@ > >>>> } else { > >>>> size = ompi_comm_size(comm); > >>>> } > >>>> - basic_module->mccb_num_reqs = size * 2; > >>>> + size *= 2; > >>>> + if (OMPI_COMM_IS_CART(comm)) { > >>>> + int cart_size; > >>>> + mca_topo_base_comm_cart_2_1_0_t *cart; > >>>> + assert (NULL != comm->c_topo); > >>>> + cart = comm->c_topo->mtc.cart; > >>>> + cart_size = cart->ndims * 4; > >>>> + if (cart_size > size) { > >>>> + size = cart_size; > >>>> + } > >>>> + } else if (OMPI_COMM_IS_GRAPH(comm)) { > >>>> + int rank, degree; > >>>> + assert (NULL != comm->c_topo); > >>>> + rank = ompi_comm_rank (comm); > >>>> + mca_topo_base_graph_neighbors_count (comm, rank, °ree); > >>>> + degree *= 2; > >>>> + if (degree > size) { > >>>> + size = degree; > >>>> + } > >>>> + } else if (OMPI_COMM_IS_DIST_GRAPH(comm)) { > >>>> + int dist_graph_size; > >>>> + mca_topo_base_comm_dist_graph_2_1_0_t *dist_graph; > >>>> + assert (NULL != comm->c_topo); > >>>> + dist_graph = comm->c_topo->mtc.dist_graph; > >>>> + dist_graph_size = dist_graph->indegree + dist_graph->outdegree; > >>>> + if (dist_graph_size > size) { > >>>> + size = dist_graph_size; > >>>> + } > >>>> + } > >>>> + basic_module->mccb_num_reqs = size; > >>>> basic_module->mccb_reqs = (ompi_request_t**) > >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15941.php > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/09/15945.php > >> > > > > <mmcb.v3.patch>_______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/10/15957.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/10/15959.php
pgpbd0gzWPE3S.pgp
Description: PGP signature