On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote:
> oops, i missed graph and dist graph :-( > > the attached patch fixes this > /* it was straightforward for graph and a bit more challenging for dist > graph */ > > i am fine for Nathan's patch for v1.8 > /* v1.8 and trunk have already a different way to handle a memory leak */ > /* the attached patch is vs v1.8 but it is for review only, i will port > it to the trunk */ > > imho, there is no ABI break with this patch. > out of curiosity, what is the ABI break that is/will be introduced with > the "real" fix ? I haven't looked at the code at all, but Nathan said it changed the topo framework interface. That's the ABI I was referring to -- not the MPI ABI. Sorry for the confusions. > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > test suite in order to test > graph and dist_graph > > Cheers, > > Gilles > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: >> On the call today, we decided that Nathan's v1.8 patch is sufficient for the >> v1.8 series, and the "real" fix will be applied to the trunk and carried >> forward to the v1.9 series (since it introduces an ABI break for the topo >> framework, which we try not to do in the middle of a release series). >> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm <hje...@lanl.gov> wrote: >> >>> Not quite right. There still is no topology information at collective >>> selection time for either graph or dist graph. >>> >>> -Nathan >>> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >>>> Nathan, >>>> >>>> here is a revision of the previously attached patch, and that supports >>>> graph and dist graph. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/09/30 0:05, Nathan Hjelm wrote: >>>> >>>> An equivalent change would need to be made for graph and dist graph as >>>> well. That will take a little more work. Also, I was avoiding changing >>>> anything in topo for 1.8. >>>> >>>> -Nathan >>>> >>>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >>>> >>>> Nathan, >>>> >>>> why not just make the topology information available at that point as you >>>> described it ? >>>> >>>> the attached patch does this, could you please review it ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/09/26 2:50, Nathan Hjelm wrote: >>>> >>>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >>>> >>>> I finally managed to track down some issues in mpi4py's test suite >>>> using Open MPI 1.8+. The code below should be enough to reproduce the >>>> problem. Run it under valgrind to make sense of my following >>>> diagnostics. >>>> >>>> In this code I'm creating a 2D, periodic Cartesian topology out of >>>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >>>> links to itself. So we have size=1 but indegree=outdegree=4. However, >>>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >>>> being allocated to manage communication: >>>> >>>> if (OMPI_COMM_IS_INTER(comm)) { >>>> size = ompi_comm_remote_size(comm); >>>> } else { >>>> size = ompi_comm_size(comm); >>>> } >>>> basic_module->mccb_num_reqs = size * 2; >>>> basic_module->mccb_reqs = (ompi_request_t**) >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >>>> >>>> I guess you have to also special-case for topologies and allocate >>>> indegree+outdegree requests (not sure about this number, just >>>> guessing). >>>> >>>> >>>> I wish this was possible but the topology information is not available >>>> at that point. We may be able to change that but I don't see the work >>>> completing anytime soon. I committed an alternative fix as r32796 and >>>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >>>> produces a SEGV. Let me know if you run into any more issues. >>>> >>>> >>>> -Nathan >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >>>> >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c >>>> =================================================================== >>>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >>>> @@ -163,10 +163,18 @@ >>>> return MPI_ERR_INTERN; >>>> } >>>> >>>> + assert(NULL == new_comm->c_topo); >>>> + assert(!(new_comm->c_flags & OMPI_COMM_CART)); >>>> + new_comm->c_topo = topo; >>>> + new_comm->c_topo->mtc.cart = cart; >>>> + new_comm->c_topo->reorder = reorder; >>>> + new_comm->c_flags |= OMPI_COMM_CART; >>>> ret = ompi_comm_enable(old_comm, new_comm, >>>> new_rank, num_procs, topo_procs); >>>> if (OMPI_SUCCESS != ret) { >>>> /* something wrong happened during setting the communicator */ >>>> + new_comm->c_topo = NULL; >>>> + new_comm->c_flags &= ~OMPI_COMM_CART; >>>> ompi_comm_free (&new_comm); >>>> free(topo_procs); >>>> if(NULL != cart->periods) free(cart->periods); >>>> @@ -176,10 +184,6 @@ >>>> return ret; >>>> } >>>> >>>> - new_comm->c_topo = topo; >>>> - new_comm->c_topo->mtc.cart = cart; >>>> - new_comm->c_topo->reorder = reorder; >>>> - new_comm->c_flags |= OMPI_COMM_CART; >>>> *comm_topo = new_comm; >>>> >>>> if( MPI_UNDEFINED == new_rank ) { >>>> Index: ompi/mca/coll/basic/coll_basic_module.c >>>> =================================================================== >>>> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) >>>> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) >>>> @@ -13,6 +13,8 @@ >>>> * Copyright (c) 2012 Sandia National Laboratories. All rights >>>> reserved. >>>> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights >>>> * reserved. >>>> + * Copyright (c) 2014 Research Organization for Information Science >>>> + * and Technology (RIST). All rights reserved. >>>> * $COPYRIGHT$ >>>> * >>>> * Additional copyrights may follow >>>> @@ -28,6 +30,7 @@ >>>> #include "mpi.h" >>>> #include "ompi/mca/coll/coll.h" >>>> #include "ompi/mca/coll/base/base.h" >>>> +#include "ompi/mca/topo/topo.h" >>>> #include "coll_basic.h" >>>> >>>> >>>> @@ -70,6 +73,15 @@ >>>> } else { >>>> size = ompi_comm_size(comm); >>>> } >>>> + if (comm->c_flags & OMPI_COMM_CART) { >>>> + int cart_size; >>>> + assert (NULL != comm->c_topo); >>>> + comm->c_topo->topo.cart.cartdim_get(comm, &cart_size); >>>> + cart_size *= 2; >>>> + if (cart_size > size) { >>>> + size = cart_size; >>>> + } >>>> + } >>>> basic_module->mccb_num_reqs = size * 2; >>>> basic_module->mccb_reqs = (ompi_request_t**) >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15929.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15930.php >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c >>>> =================================================================== >>>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32815) >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >>>> @@ -163,10 +163,18 @@ >>>> return MPI_ERR_INTERN; >>>> } >>>> >>>> + assert(NULL == new_comm->c_topo); >>>> + assert(!(new_comm->c_flags & OMPI_COMM_CART)); >>>> + new_comm->c_topo = topo; >>>> + new_comm->c_topo->mtc.cart = cart; >>>> + new_comm->c_topo->reorder = reorder; >>>> + new_comm->c_flags |= OMPI_COMM_CART; >>>> ret = ompi_comm_enable(old_comm, new_comm, >>>> new_rank, num_procs, topo_procs); >>>> if (OMPI_SUCCESS != ret) { >>>> /* something wrong happened during setting the communicator */ >>>> + new_comm->c_topo = NULL; >>>> + new_comm->c_flags &= ~OMPI_COMM_CART; >>>> ompi_comm_free (&new_comm); >>>> free(topo_procs); >>>> if(NULL != cart->periods) free(cart->periods); >>>> @@ -176,10 +184,6 @@ >>>> return ret; >>>> } >>>> >>>> - new_comm->c_topo = topo; >>>> - new_comm->c_topo->mtc.cart = cart; >>>> - new_comm->c_topo->reorder = reorder; >>>> - new_comm->c_flags |= OMPI_COMM_CART; >>>> *comm_topo = new_comm; >>>> >>>> if( MPI_UNDEFINED == new_rank ) { >>>> Index: ompi/mca/coll/basic/coll_basic_module.c >>>> =================================================================== >>>> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32815) >>>> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) >>>> @@ -13,6 +13,8 @@ >>>> * Copyright (c) 2012 Sandia National Laboratories. All rights >>>> reserved. >>>> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights >>>> * reserved. >>>> + * Copyright (c) 2014 Research Organization for Information Science >>>> + * and Technology (RIST). All rights reserved. >>>> * $COPYRIGHT$ >>>> * >>>> * Additional copyrights may follow >>>> @@ -28,6 +30,8 @@ >>>> #include "mpi.h" >>>> #include "ompi/mca/coll/coll.h" >>>> #include "ompi/mca/coll/base/base.h" >>>> +#include "ompi/mca/topo/topo.h" >>>> +#include "ompi/mca/topo/base/base.h" >>>> #include "coll_basic.h" >>>> >>>> >>>> @@ -70,7 +74,36 @@ >>>> } else { >>>> size = ompi_comm_size(comm); >>>> } >>>> - basic_module->mccb_num_reqs = size * 2; >>>> + size *= 2; >>>> + if (OMPI_COMM_IS_CART(comm)) { >>>> + int cart_size; >>>> + mca_topo_base_comm_cart_2_1_0_t *cart; >>>> + assert (NULL != comm->c_topo); >>>> + cart = comm->c_topo->mtc.cart; >>>> + cart_size = cart->ndims * 4; >>>> + if (cart_size > size) { >>>> + size = cart_size; >>>> + } >>>> + } else if (OMPI_COMM_IS_GRAPH(comm)) { >>>> + int rank, degree; >>>> + assert (NULL != comm->c_topo); >>>> + rank = ompi_comm_rank (comm); >>>> + mca_topo_base_graph_neighbors_count (comm, rank, °ree); >>>> + degree *= 2; >>>> + if (degree > size) { >>>> + size = degree; >>>> + } >>>> + } else if (OMPI_COMM_IS_DIST_GRAPH(comm)) { >>>> + int dist_graph_size; >>>> + mca_topo_base_comm_dist_graph_2_1_0_t *dist_graph; >>>> + assert (NULL != comm->c_topo); >>>> + dist_graph = comm->c_topo->mtc.dist_graph; >>>> + dist_graph_size = dist_graph->indegree + dist_graph->outdegree; >>>> + if (dist_graph_size > size) { >>>> + size = dist_graph_size; >>>> + } >>>> + } >>>> + basic_module->mccb_num_reqs = size; >>>> basic_module->mccb_reqs = (ompi_request_t**) >>>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15941.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15945.php >> > > <mmcb.v3.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/10/15957.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/