I should have been clearer. This changes how the topo components are
expected to work. With this change they now MUST provide the c_topo to
the communicator before coll_select. The change is good but may be
unexpected for any vendor topo components out there (which I doubt
there are any).

-Nathan

On Wed, Oct 01, 2014 at 12:52:00PM +0000, Jeff Squyres (jsquyres) wrote:
> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
> > oops, i missed graph and dist graph :-(
> > 
> > the attached patch fixes this
> > /* it was straightforward for graph and a bit more challenging for dist
> > graph */
> > 
> > i am fine for Nathan's patch for v1.8
> > /* v1.8 and trunk have already a different way to handle a memory leak */
> > /* the attached patch is vs v1.8 but it is for review only, i will port
> > it to the trunk */
> > 
> > imho, there is no ABI break with this patch.
> > out of curiosity, what is the ABI break that is/will be introduced with
> > the "real" fix ?
> 
> I haven't looked at the code at all, but Nathan said it changed the topo 
> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
> Sorry for the confusions.
> 
> > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
> > test suite in order to test
> > graph and dist_graph
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > 
> > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
> >> On the call today, we decided that Nathan's v1.8 patch is sufficient for 
> >> the v1.8 series, and the "real" fix will be applied to the trunk and 
> >> carried forward to the v1.9 series (since it introduces an ABI break for 
> >> the topo framework, which we try not to do in the middle of a release 
> >> series).
> >> 
> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >> 
> >>> Not quite right. There still is no topology information at collective
> >>> selection time for either graph or dist graph.
> >>> 
> >>> -Nathan
> >>> 
> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
> >>>>  Nathan,
> >>>> 
> >>>>  here is a revision of the previously attached patch, and that supports
> >>>>  graph and dist graph.
> >>>> 
> >>>>  Cheers,
> >>>> 
> >>>>  Gilles
> >>>> 
> >>>>  On 2014/09/30 0:05, Nathan Hjelm wrote:
> >>>> 
> >>>> An equivalent change would need to be made for graph and dist graph as
> >>>> well. That will take a little more work. Also, I was avoiding changing
> >>>> anything in topo for 1.8.
> >>>> 
> >>>> -Nathan
> >>>> 
> >>>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
> >>>> 
> >>>>   Nathan,
> >>>> 
> >>>>   why not just make the topology information available at that point as 
> >>>> you
> >>>>   described it ?
> >>>> 
> >>>>   the attached patch does this, could you please review it ?
> >>>> 
> >>>>   Cheers,
> >>>> 
> >>>>   Gilles
> >>>> 
> >>>>   On 2014/09/26 2:50, Nathan Hjelm wrote:
> >>>> 
> >>>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> >>>> 
> >>>> I finally managed to track down some issues in mpi4py's test suite
> >>>> using Open MPI 1.8+. The code below should be enough to reproduce the
> >>>> problem. Run it under valgrind to make sense of my following
> >>>> diagnostics.
> >>>> 
> >>>> In this code I'm creating a 2D, periodic Cartesian topology out of
> >>>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> >>>> links to itself. So we have size=1 but indegree=outdegree=4. However,
> >>>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> >>>> being allocated to manage communication:
> >>>> 
> >>>>     if (OMPI_COMM_IS_INTER(comm)) {
> >>>>         size = ompi_comm_remote_size(comm);
> >>>>     } else {
> >>>>         size = ompi_comm_size(comm);
> >>>>     }
> >>>>     basic_module->mccb_num_reqs = size * 2;
> >>>>     basic_module->mccb_reqs = (ompi_request_t**)
> >>>>         malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> >>>> 
> >>>> I guess you have to also special-case for topologies and allocate
> >>>> indegree+outdegree requests (not sure about this number, just
> >>>> guessing).
> >>>> 
> >>>> 
> >>>> I wish this was possible but the topology information is not available
> >>>> at that point. We may be able to change that but I don't see the work
> >>>> completing anytime soon. I committed an alternative fix as r32796 and
> >>>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> >>>> produces a SEGV. Let me know if you run into any more issues.
> >>>> 
> >>>> 
> >>>> -Nathan
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
> >>>> 
> >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c
> >>>> ===================================================================
> >>>> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
> >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
> >>>> @@ -163,10 +163,18 @@
> >>>>         return MPI_ERR_INTERN;
> >>>>     }
> >>>> 
> >>>> +    assert(NULL == new_comm->c_topo);
> >>>> +    assert(!(new_comm->c_flags & OMPI_COMM_CART));
> >>>> +    new_comm->c_topo           = topo;
> >>>> +    new_comm->c_topo->mtc.cart = cart;
> >>>> +    new_comm->c_topo->reorder  = reorder;
> >>>> +    new_comm->c_flags         |= OMPI_COMM_CART;
> >>>>     ret = ompi_comm_enable(old_comm, new_comm,
> >>>>                            new_rank, num_procs, topo_procs);
> >>>>     if (OMPI_SUCCESS != ret) {
> >>>>         /* something wrong happened during setting the communicator */
> >>>> +        new_comm->c_topo = NULL;
> >>>> +        new_comm->c_flags &= ~OMPI_COMM_CART;
> >>>>         ompi_comm_free (&new_comm);
> >>>>         free(topo_procs);
> >>>>         if(NULL != cart->periods) free(cart->periods);
> >>>> @@ -176,10 +184,6 @@
> >>>>         return ret;
> >>>>     }
> >>>> 
> >>>> -    new_comm->c_topo           = topo;
> >>>> -    new_comm->c_topo->mtc.cart = cart;
> >>>> -    new_comm->c_topo->reorder  = reorder;
> >>>> -    new_comm->c_flags         |= OMPI_COMM_CART;
> >>>>     *comm_topo = new_comm;
> >>>> 
> >>>>     if( MPI_UNDEFINED == new_rank ) {
> >>>> Index: ompi/mca/coll/basic/coll_basic_module.c
> >>>> ===================================================================
> >>>> --- ompi/mca/coll/basic/coll_basic_module.c     (revision 32807)
> >>>> +++ ompi/mca/coll/basic/coll_basic_module.c     (working copy)
> >>>> @@ -13,6 +13,8 @@
> >>>>  * Copyright (c) 2012      Sandia National Laboratories. All rights 
> >>>> reserved.
> >>>>  * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
> >>>>  *                         reserved.
> >>>> + * Copyright (c) 2014      Research Organization for Information Science
> >>>> + *                         and Technology (RIST). All rights reserved.
> >>>>  * $COPYRIGHT$
> >>>>  *
> >>>>  * Additional copyrights may follow
> >>>> @@ -28,6 +30,7 @@
> >>>> #include "mpi.h"
> >>>> #include "ompi/mca/coll/coll.h"
> >>>> #include "ompi/mca/coll/base/base.h"
> >>>> +#include "ompi/mca/topo/topo.h"
> >>>> #include "coll_basic.h"
> >>>> 
> >>>> 
> >>>> @@ -70,6 +73,15 @@
> >>>>     } else {
> >>>>         size = ompi_comm_size(comm);
> >>>>     }
> >>>> +    if (comm->c_flags & OMPI_COMM_CART) {
> >>>> +        int cart_size;
> >>>> +        assert (NULL != comm->c_topo);
> >>>> +        comm->c_topo->topo.cart.cartdim_get(comm, &cart_size);
> >>>> +        cart_size *= 2;
> >>>> +        if (cart_size > size) {
> >>>> +            size = cart_size;
> >>>> +        }
> >>>> +    }
> >>>>     basic_module->mccb_num_reqs = size * 2;
> >>>>     basic_module->mccb_reqs = (ompi_request_t**)
> >>>>         malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15929.php
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15930.php
> >>>> Index: ompi/mca/topo/base/topo_base_cart_create.c
> >>>> ===================================================================
> >>>> --- ompi/mca/topo/base/topo_base_cart_create.c   (revision 32815)
> >>>> +++ ompi/mca/topo/base/topo_base_cart_create.c   (working copy)
> >>>> @@ -163,10 +163,18 @@
> >>>>        return MPI_ERR_INTERN;
> >>>>    }
> >>>> 
> >>>> +    assert(NULL == new_comm->c_topo);
> >>>> +    assert(!(new_comm->c_flags & OMPI_COMM_CART));
> >>>> +    new_comm->c_topo           = topo;
> >>>> +    new_comm->c_topo->mtc.cart = cart;
> >>>> +    new_comm->c_topo->reorder  = reorder;
> >>>> +    new_comm->c_flags         |= OMPI_COMM_CART;
> >>>>    ret = ompi_comm_enable(old_comm, new_comm,
> >>>>                           new_rank, num_procs, topo_procs);
> >>>>    if (OMPI_SUCCESS != ret) {
> >>>>        /* something wrong happened during setting the communicator */
> >>>> +        new_comm->c_topo = NULL;
> >>>> +        new_comm->c_flags &= ~OMPI_COMM_CART;
> >>>>        ompi_comm_free (&new_comm);
> >>>>        free(topo_procs);
> >>>>        if(NULL != cart->periods) free(cart->periods);
> >>>> @@ -176,10 +184,6 @@
> >>>>        return ret;
> >>>>    }
> >>>> 
> >>>> -    new_comm->c_topo           = topo;
> >>>> -    new_comm->c_topo->mtc.cart = cart;
> >>>> -    new_comm->c_topo->reorder  = reorder;
> >>>> -    new_comm->c_flags         |= OMPI_COMM_CART;
> >>>>    *comm_topo = new_comm;
> >>>> 
> >>>>    if( MPI_UNDEFINED == new_rank ) {
> >>>> Index: ompi/mca/coll/basic/coll_basic_module.c
> >>>> ===================================================================
> >>>> --- ompi/mca/coll/basic/coll_basic_module.c      (revision 32815)
> >>>> +++ ompi/mca/coll/basic/coll_basic_module.c      (working copy)
> >>>> @@ -13,6 +13,8 @@
> >>>> * Copyright (c) 2012      Sandia National Laboratories. All rights 
> >>>> reserved.
> >>>> * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
> >>>> *                         reserved.
> >>>> + * Copyright (c) 2014      Research Organization for Information Science
> >>>> + *                         and Technology (RIST). All rights reserved.
> >>>> * $COPYRIGHT$
> >>>> * 
> >>>> * Additional copyrights may follow
> >>>> @@ -28,6 +30,8 @@
> >>>> #include "mpi.h"
> >>>> #include "ompi/mca/coll/coll.h"
> >>>> #include "ompi/mca/coll/base/base.h"
> >>>> +#include "ompi/mca/topo/topo.h"
> >>>> +#include "ompi/mca/topo/base/base.h"
> >>>> #include "coll_basic.h"
> >>>> 
> >>>> 
> >>>> @@ -70,7 +74,36 @@
> >>>>    } else {
> >>>>        size = ompi_comm_size(comm);
> >>>>    }
> >>>> -    basic_module->mccb_num_reqs = size * 2;
> >>>> +    size *= 2;
> >>>> +    if (OMPI_COMM_IS_CART(comm)) {
> >>>> +        int cart_size;
> >>>> +        mca_topo_base_comm_cart_2_1_0_t *cart;
> >>>> +        assert (NULL != comm->c_topo);
> >>>> +        cart = comm->c_topo->mtc.cart;
> >>>> +        cart_size = cart->ndims * 4;
> >>>> +        if (cart_size > size) {
> >>>> +            size = cart_size;
> >>>> +        }
> >>>> +    } else if (OMPI_COMM_IS_GRAPH(comm)) {
> >>>> +        int rank, degree;
> >>>> +        assert (NULL != comm->c_topo);
> >>>> +        rank = ompi_comm_rank (comm);
> >>>> +        mca_topo_base_graph_neighbors_count (comm, rank, &degree);
> >>>> +        degree *= 2;
> >>>> +        if (degree > size) {
> >>>> +            size = degree;
> >>>> +        }
> >>>> +    } else if (OMPI_COMM_IS_DIST_GRAPH(comm)) {
> >>>> +        int dist_graph_size;
> >>>> +        mca_topo_base_comm_dist_graph_2_1_0_t *dist_graph;
> >>>> +        assert (NULL != comm->c_topo);
> >>>> +        dist_graph = comm->c_topo->mtc.dist_graph;
> >>>> +        dist_graph_size = dist_graph->indegree + dist_graph->outdegree;
> >>>> +        if (dist_graph_size > size) {
> >>>> +            size = dist_graph_size;
> >>>> +        }
> >>>> +    }
> >>>> +    basic_module->mccb_num_reqs = size;
> >>>>    basic_module->mccb_reqs = (ompi_request_t**) 
> >>>>        malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> >>>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15941.php
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2014/09/15945.php
> >> 
> > 
> > <mmcb.v3.patch>_______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/10/15957.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/10/15959.php

Attachment: pgpbd0gzWPE3S.pgp
Description: PGP signature

Reply via email to