Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
I should have been clearer. This changes how the topo components are expected to work. With this change they now MUST provide the c_topo to the communicator before coll_select. The change is good but may be unexpected for any vendor topo components out there (which I doubt there are any). -Nathan On Wed, Oct 01, 2014 at 12:52:00PM +, Jeff Squyres (jsquyres) wrote: > On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet >wrote: > > > oops, i missed graph and dist graph :-( > > > > the attached patch fixes this > > /* it was straightforward for graph and a bit more challenging for dist > > graph */ > > > > i am fine for Nathan's patch for v1.8 > > /* v1.8 and trunk have already a different way to handle a memory leak */ > > /* the attached patch is vs v1.8 but it is for review only, i will port > > it to the trunk */ > > > > imho, there is no ABI break with this patch. > > out of curiosity, what is the ABI break that is/will be introduced with > > the "real" fix ? > > I haven't looked at the code at all, but Nathan said it changed the topo > framework interface. That's the ABI I was referring to -- not the MPI ABI. > Sorry for the confusions. > > > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > > test suite in order to test > > graph and dist_graph > > > > Cheers, > > > > Gilles > > > > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: > >> On the call today, we decided that Nathan's v1.8 patch is sufficient for > >> the v1.8 series, and the "real" fix will be applied to the trunk and > >> carried forward to the v1.9 series (since it introduces an ABI break for > >> the topo framework, which we try not to do in the middle of a release > >> series). > >> > >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: > >> > >>> Not quite right. There still is no topology information at collective > >>> selection time for either graph or dist graph. > >>> > >>> -Nathan > >>> > >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: > Nathan, > > here is a revision of the previously attached patch, and that supports > graph and dist graph. > > Cheers, > > Gilles > > On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as > you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c >
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
I see no change in the topo interface in any of the patches attached to this thread. Is there any other patch related to this discussion? George. > On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres)wrote: > >> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet >> wrote: >> >> oops, i missed graph and dist graph :-( >> >> the attached patch fixes this >> /* it was straightforward for graph and a bit more challenging for dist >> graph */ >> >> i am fine for Nathan's patch for v1.8 >> /* v1.8 and trunk have already a different way to handle a memory leak */ >> /* the attached patch is vs v1.8 but it is for review only, i will port >> it to the trunk */ >> >> imho, there is no ABI break with this patch. >> out of curiosity, what is the ABI break that is/will be introduced with >> the "real" fix ? > > I haven't looked at the code at all, but Nathan said it changed the topo > framework interface. That's the ABI I was referring to -- not the MPI ABI. > Sorry for the confusions. > >> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm >> test suite in order to test >> graph and dist_graph >> >> Cheers, >> >> Gilles >> >> >>> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: >>> On the call today, we decided that Nathan's v1.8 patch is sufficient for >>> the v1.8 series, and the "real" fix will be applied to the trunk and >>> carried forward to the v1.9 series (since it introduces an ABI break for >>> the topo framework, which we try not to do in the middle of a release >>> series). >>> >>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: >>> Not quite right. There still is no topology information at collective selection time for either graph or dist graph. -Nathan > On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: > Nathan, > > here is a revision of the previously attached patch, and that supports > graph and dist graph. > > Cheers, > > Gilles > > On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > >if (OMPI_COMM_IS_INTER(comm)) { >size = ompi_comm_remote_size(comm); >} else { >size = ompi_comm_size(comm); >} >basic_module->mccb_num_reqs = size * 2; >basic_module->mccb_reqs = (ompi_request_t**) >malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > @@ -163,10 +163,18 @@ >return MPI_ERR_INTERN; >} > > +assert(NULL ==
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardetwrote: > oops, i missed graph and dist graph :-( > > the attached patch fixes this > /* it was straightforward for graph and a bit more challenging for dist > graph */ > > i am fine for Nathan's patch for v1.8 > /* v1.8 and trunk have already a different way to handle a memory leak */ > /* the attached patch is vs v1.8 but it is for review only, i will port > it to the trunk */ > > imho, there is no ABI break with this patch. > out of curiosity, what is the ABI break that is/will be introduced with > the "real" fix ? I haven't looked at the code at all, but Nathan said it changed the topo framework interface. That's the ABI I was referring to -- not the MPI ABI. Sorry for the confusions. > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > test suite in order to test > graph and dist_graph > > Cheers, > > Gilles > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: >> On the call today, we decided that Nathan's v1.8 patch is sufficient for the >> v1.8 series, and the "real" fix will be applied to the trunk and carried >> forward to the v1.9 series (since it introduces an ABI break for the topo >> framework, which we try not to do in the middle of a release series). >> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: >> >>> Not quite right. There still is no topology information at collective >>> selection time for either graph or dist graph. >>> >>> -Nathan >>> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: Nathan, here is a revision of the previously attached patch, and that supports graph and dist graph. Cheers, Gilles On 2014/09/30 0:05, Nathan Hjelm wrote: An equivalent change would need to be made for graph and dist graph as well. That will take a little more work. Also, I was avoiding changing anything in topo for 1.8. -Nathan On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: Nathan, why not just make the topology information available at that point as you described it ? the attached patch does this, could you please review it ? Cheers, Gilles On 2014/09/26 2:50, Nathan Hjelm wrote: On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: I finally managed to track down some issues in mpi4py's test suite using Open MPI 1.8+. The code below should be enough to reproduce the problem. Run it under valgrind to make sense of my following diagnostics. In this code I'm creating a 2D, periodic Cartesian topology out of COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out links to itself. So we have size=1 but indegree=outdegree=4. However, in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are being allocated to manage communication: if (OMPI_COMM_IS_INTER(comm)) { size = ompi_comm_remote_size(comm); } else { size = ompi_comm_size(comm); } basic_module->mccb_num_reqs = size * 2; basic_module->mccb_reqs = (ompi_request_t**) malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); I guess you have to also special-case for topologies and allocate indegree+outdegree requests (not sure about this number, just guessing). I wish this was possible but the topology information is not available at that point. We may be able to change that but I don't see the work completing anytime soon. I committed an alternative fix as r32796 and CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer produces a SEGV. Let me know if you run into any more issues. -Nathan ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/09/15915.php Index: ompi/mca/topo/base/topo_base_cart_create.c === --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) @@ -163,10 +163,18 @@ return MPI_ERR_INTERN; } +assert(NULL == new_comm->c_topo); +assert(!(new_comm->c_flags & OMPI_COMM_CART)); +new_comm->c_topo = topo; +new_comm->c_topo->mtc.cart = cart; +new_comm->c_topo->reorder = reorder; +new_comm->c_flags |= OMPI_COMM_CART; ret = ompi_comm_enable(old_comm, new_comm, new_rank,
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
oops, i missed graph and dist graph :-( the attached patch fixes this /* it was straightforward for graph and a bit more challenging for dist graph */ i am fine for Nathan's patch for v1.8 /* v1.8 and trunk have already a different way to handle a memory leak */ /* the attached patch is vs v1.8 but it is for review only, i will port it to the trunk */ imho, there is no ABI break with this patch. out of curiosity, what is the ABI break that is/will be introduced with the "real" fix ? FYI, i also enhanced the collective/neighbor_allgather_self from the ibm test suite in order to test graph and dist_graph Cheers, Gilles On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: > On the call today, we decided that Nathan's v1.8 patch is sufficient for the > v1.8 series, and the "real" fix will be applied to the trunk and carried > forward to the v1.9 series (since it introduces an ABI break for the topo > framework, which we try not to do in the middle of a release series). > > On Sep 30, 2014, at 10:40 AM, Nathan Hjelmwrote: > >> Not quite right. There still is no topology information at collective >> selection time for either graph or dist graph. >> >> -Nathan >> >> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >>> Nathan, >>> >>> here is a revision of the previously attached patch, and that supports >>> graph and dist graph. >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/30 0:05, Nathan Hjelm wrote: >>> >>> An equivalent change would need to be made for graph and dist graph as >>> well. That will take a little more work. Also, I was avoiding changing >>> anything in topo for 1.8. >>> >>> -Nathan >>> >>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >>> >>>Nathan, >>> >>>why not just make the topology information available at that point as you >>>described it ? >>> >>>the attached patch does this, could you please review it ? >>> >>>Cheers, >>> >>>Gilles >>> >>>On 2014/09/26 2:50, Nathan Hjelm wrote: >>> >>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >>> >>> I finally managed to track down some issues in mpi4py's test suite >>> using Open MPI 1.8+. The code below should be enough to reproduce the >>> problem. Run it under valgrind to make sense of my following >>> diagnostics. >>> >>> In this code I'm creating a 2D, periodic Cartesian topology out of >>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >>> links to itself. So we have size=1 but indegree=outdegree=4. However, >>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >>> being allocated to manage communication: >>> >>> if (OMPI_COMM_IS_INTER(comm)) { >>> size = ompi_comm_remote_size(comm); >>> } else { >>> size = ompi_comm_size(comm); >>> } >>> basic_module->mccb_num_reqs = size * 2; >>> basic_module->mccb_reqs = (ompi_request_t**) >>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >>> >>> I guess you have to also special-case for topologies and allocate >>> indegree+outdegree requests (not sure about this number, just >>> guessing). >>> >>> >>> I wish this was possible but the topology information is not available >>> at that point. We may be able to change that but I don't see the work >>> completing anytime soon. I committed an alternative fix as r32796 and >>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >>> produces a SEGV. Let me know if you run into any more issues. >>> >>> >>> -Nathan >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >>> >>> Index: ompi/mca/topo/base/topo_base_cart_create.c >>> === >>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >>> @@ -163,10 +163,18 @@ >>> return MPI_ERR_INTERN; >>> } >>> >>> +assert(NULL == new_comm->c_topo); >>> +assert(!(new_comm->c_flags & OMPI_COMM_CART)); >>> +new_comm->c_topo = topo; >>> +new_comm->c_topo->mtc.cart = cart; >>> +new_comm->c_topo->reorder = reorder; >>> +new_comm->c_flags |= OMPI_COMM_CART; >>> ret = ompi_comm_enable(old_comm, new_comm, >>> new_rank, num_procs, topo_procs); >>> if (OMPI_SUCCESS != ret) { >>> /* something wrong happened during setting the communicator */ >>> +new_comm->c_topo = NULL; >>> +new_comm->c_flags &= ~OMPI_COMM_CART; >>> ompi_comm_free (_comm); >>> free(topo_procs); >>> if(NULL != cart->periods) free(cart->periods); >>> @@ -176,10 +184,6 @@ >>> return ret;
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Not quite right. There still is no topology information at collective selection time for either graph or dist graph. -Nathan On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >Nathan, > >here is a revision of the previously attached patch, and that supports >graph and dist graph. > >Cheers, > >Gilles > >On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > @@ -163,10 +163,18 @@ > return MPI_ERR_INTERN; > } > > +assert(NULL == new_comm->c_topo); > +assert(!(new_comm->c_flags & OMPI_COMM_CART)); > +new_comm->c_topo = topo; > +new_comm->c_topo->mtc.cart = cart; > +new_comm->c_topo->reorder = reorder; > +new_comm->c_flags |= OMPI_COMM_CART; > ret = ompi_comm_enable(old_comm, new_comm, > new_rank, num_procs, topo_procs); > if (OMPI_SUCCESS != ret) { > /* something wrong happened during setting the communicator */ > +new_comm->c_topo = NULL; > +new_comm->c_flags &= ~OMPI_COMM_CART; > ompi_comm_free (_comm); > free(topo_procs); > if(NULL != cart->periods) free(cart->periods); > @@ -176,10 +184,6 @@ > return ret; > } > > -new_comm->c_topo = topo; > -new_comm->c_topo->mtc.cart = cart; > -new_comm->c_topo->reorder = reorder; > -new_comm->c_flags |= OMPI_COMM_CART; > *comm_topo = new_comm; > > if( MPI_UNDEFINED == new_rank ) { > Index: ompi/mca/coll/basic/coll_basic_module.c > === > --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) > +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > @@ -13,6 +13,8 @@ >* Copyright (c) 2012 Sandia National Laboratories. All rights > reserved. >* Copyright (c) 2013 Los Alamos National Security, LLC. All rights >* reserved. > + * Copyright (c) 2014 Research Organization for Information Science > + * and Technology (RIST). All rights reserved. >* $COPYRIGHT$ >* >* Additional copyrights may follow > @@ -28,6 +30,7 @@ > #include "mpi.h" > #include "ompi/mca/coll/coll.h" > #include "ompi/mca/coll/base/base.h" > +#include "ompi/mca/topo/topo.h" > #include "coll_basic.h" > > > @@ -70,6
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Nathan, here is a revision of the previously attached patch, and that supports graph and dist graph. Cheers, Gilles On 2014/09/30 0:05, Nathan Hjelm wrote: > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >>Nathan, >> >>why not just make the topology information available at that point as you >>described it ? >> >>the attached patch does this, could you please review it ? >> >>Cheers, >> >>Gilles >> >>On 2014/09/26 2:50, Nathan Hjelm wrote: >> >> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> >> >> I wish this was possible but the topology information is not available >> at that point. We may be able to change that but I don't see the work >> completing anytime soon. I committed an alternative fix as r32796 and >> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >> produces a SEGV. Let me know if you run into any more issues. >> >> >> -Nathan >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >> Index: ompi/mca/topo/base/topo_base_cart_create.c >> === >> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >> @@ -163,10 +163,18 @@ >> return MPI_ERR_INTERN; >> } >> >> +assert(NULL == new_comm->c_topo); >> +assert(!(new_comm->c_flags & OMPI_COMM_CART)); >> +new_comm->c_topo = topo; >> +new_comm->c_topo->mtc.cart = cart; >> +new_comm->c_topo->reorder = reorder; >> +new_comm->c_flags |= OMPI_COMM_CART; >> ret = ompi_comm_enable(old_comm, new_comm, >> new_rank, num_procs, topo_procs); >> if (OMPI_SUCCESS != ret) { >> /* something wrong happened during setting the communicator */ >> +new_comm->c_topo = NULL; >> +new_comm->c_flags &= ~OMPI_COMM_CART; >> ompi_comm_free (_comm); >> free(topo_procs); >> if(NULL != cart->periods) free(cart->periods); >> @@ -176,10 +184,6 @@ >> return ret; >> } >> >> -new_comm->c_topo = topo; >> -new_comm->c_topo->mtc.cart = cart; >> -new_comm->c_topo->reorder = reorder; >> -new_comm->c_flags |= OMPI_COMM_CART; >> *comm_topo = new_comm; >> >> if( MPI_UNDEFINED == new_rank ) { >> Index: ompi/mca/coll/basic/coll_basic_module.c >> === >> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) >> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) >> @@ -13,6 +13,8 @@ >> * Copyright (c) 2012 Sandia National Laboratories. All rights >> reserved. >> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights >> * reserved. >> + * Copyright (c) 2014 Research Organization for Information Science >> + * and Technology (RIST). All rights reserved. >> * $COPYRIGHT$ >> * >> * Additional copyrights may follow >> @@ -28,6 +30,7 @@ >> #include "mpi.h" >> #include "ompi/mca/coll/coll.h" >> #include "ompi/mca/coll/base/base.h" >> +#include "ompi/mca/topo/topo.h" >> #include "coll_basic.h" >> >> >> @@ -70,6 +73,15 @@ >> } else { >> size = ompi_comm_size(comm); >> } >> +if (comm->c_flags & OMPI_COMM_CART) { >> +int cart_size; >> +assert (NULL != comm->c_topo); >> +
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
An equivalent change would need to be made for graph and dist graph as well. That will take a little more work. Also, I was avoiding changing anything in topo for 1.8. -Nathan On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >Nathan, > >why not just make the topology information available at that point as you >described it ? > >the attached patch does this, could you please review it ? > >Cheers, > >Gilles > >On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c(revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c(working copy) > @@ -163,10 +163,18 @@ > return MPI_ERR_INTERN; > } > > +assert(NULL == new_comm->c_topo); > +assert(!(new_comm->c_flags & OMPI_COMM_CART)); > +new_comm->c_topo = topo; > +new_comm->c_topo->mtc.cart = cart; > +new_comm->c_topo->reorder = reorder; > +new_comm->c_flags |= OMPI_COMM_CART; > ret = ompi_comm_enable(old_comm, new_comm, > new_rank, num_procs, topo_procs); > if (OMPI_SUCCESS != ret) { > /* something wrong happened during setting the communicator */ > +new_comm->c_topo = NULL; > +new_comm->c_flags &= ~OMPI_COMM_CART; > ompi_comm_free (_comm); > free(topo_procs); > if(NULL != cart->periods) free(cart->periods); > @@ -176,10 +184,6 @@ > return ret; > } > > -new_comm->c_topo = topo; > -new_comm->c_topo->mtc.cart = cart; > -new_comm->c_topo->reorder = reorder; > -new_comm->c_flags |= OMPI_COMM_CART; > *comm_topo = new_comm; > > if( MPI_UNDEFINED == new_rank ) { > Index: ompi/mca/coll/basic/coll_basic_module.c > === > --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) > +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > @@ -13,6 +13,8 @@ > * Copyright (c) 2012 Sandia National Laboratories. All rights reserved. > * Copyright (c) 2013 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2014 Research Organization for Information Science > + * and Technology (RIST). All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -28,6 +30,7 @@ > #include "mpi.h" > #include "ompi/mca/coll/coll.h" > #include "ompi/mca/coll/base/base.h" > +#include "ompi/mca/topo/topo.h" > #include "coll_basic.h" > > > @@ -70,6 +73,15 @@ > } else { > size = ompi_comm_size(comm); > } > +if (comm->c_flags & OMPI_COMM_CART) { > +int cart_size; > +assert (NULL != comm->c_topo); > +comm->c_topo->topo.cart.cartdim_get(comm, _size); > +cart_size *= 2; > +if (cart_size > size) { > +size = cart_size; > +} > +} > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) *
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Nathan, why not just make the topology information available at that point as you described it ? the attached patch does this, could you please review it ? Cheers, Gilles On 2014/09/26 2:50, Nathan Hjelm wrote: > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php Index: ompi/mca/topo/base/topo_base_cart_create.c === --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) @@ -163,10 +163,18 @@ return MPI_ERR_INTERN; } +assert(NULL == new_comm->c_topo); +assert(!(new_comm->c_flags & OMPI_COMM_CART)); +new_comm->c_topo = topo; +new_comm->c_topo->mtc.cart = cart; +new_comm->c_topo->reorder = reorder; +new_comm->c_flags |= OMPI_COMM_CART; ret = ompi_comm_enable(old_comm, new_comm, new_rank, num_procs, topo_procs); if (OMPI_SUCCESS != ret) { /* something wrong happened during setting the communicator */ +new_comm->c_topo = NULL; +new_comm->c_flags &= ~OMPI_COMM_CART; ompi_comm_free (_comm); free(topo_procs); if(NULL != cart->periods) free(cart->periods); @@ -176,10 +184,6 @@ return ret; } -new_comm->c_topo = topo; -new_comm->c_topo->mtc.cart = cart; -new_comm->c_topo->reorder = reorder; -new_comm->c_flags |= OMPI_COMM_CART; *comm_topo = new_comm; if( MPI_UNDEFINED == new_rank ) { Index: ompi/mca/coll/basic/coll_basic_module.c === --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) @@ -13,6 +13,8 @@ * Copyright (c) 2012 Sandia National Laboratories. All rights reserved. * Copyright (c) 2013 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -28,6 +30,7 @@ #include "mpi.h" #include "ompi/mca/coll/coll.h" #include "ompi/mca/coll/base/base.h" +#include "ompi/mca/topo/topo.h" #include "coll_basic.h" @@ -70,6 +73,15 @@ } else { size = ompi_comm_size(comm); } +if (comm->c_flags & OMPI_COMM_CART) { +int cart_size; +assert (NULL != comm->c_topo); +comm->c_topo->topo.cart.cartdim_get(comm, _size); +cart_size *= 2; +if (cart_size > size) { +size = cart_size; +} +} basic_module->mccb_num_reqs = size * 2; basic_module->mccb_reqs = (ompi_request_t**) malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On 25 September 2014 20:50, Nathan Hjelmwrote: > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > Did your fix get in for 1.8.3? I'm still getting the segfault. -- Lisandro Dalcin Research Scientist Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Numerical Porous Media Center (NumPor) King Abdullah University of Science and Technology (KAUST) http://numpor.kaust.edu.sa/ 4700 King Abdullah University of Science and Technology al-Khawarizmi Bldg (Bldg 1), Office # 4332 Thuwal 23955-6900, Kingdom of Saudi Arabia http://www.kaust.edu.sa Office Phone: +966 12 808-0459
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Good catch. I will take a look and see how best to fix this. -Nathan On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > #include > #include > > int main(int argc, char *argv[]) > { > MPI_Comm comm; > int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1}; > int sendbuf = 7, recvbuf[5] = {0,0,0,0,0}; > MPI_Init(, ); > > MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, ); > > MPI_Neighbor_allgather(, 1, MPI_INT, > recvbuf, 1, MPI_INT, > comm); > > {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");} > > MPI_Finalize(); > return 0; > } > > > -- > Lisandro Dalcin > > Research Scientist > Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) > Numerical Porous Media Center (NumPor) > King Abdullah University of Science and Technology (KAUST) > http://numpor.kaust.edu.sa/ > > 4700 King Abdullah University of Science and Technology > al-Khawarizmi Bldg (Bldg 1), Office # 4332 > Thuwal 23955-6900, Kingdom of Saudi Arabia > http://www.kaust.edu.sa > > Office Phone: +966 12 808-0459 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15713.php pgp4ELRAwv940.pgp Description: PGP signature