Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
I should have been clearer. This changes how the topo components are expected to work. With this change they now MUST provide the c_topo to the communicator before coll_select. The change is good but may be unexpected for any vendor topo components out there (which I doubt there are any). -Nathan On Wed, Oct 01, 2014 at 12:52:00PM +, Jeff Squyres (jsquyres) wrote: > On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet > wrote: > > > oops, i missed graph and dist graph :-( > > > > the attached patch fixes this > > /* it was straightforward for graph and a bit more challenging for dist > > graph */ > > > > i am fine for Nathan's patch for v1.8 > > /* v1.8 and trunk have already a different way to handle a memory leak */ > > /* the attached patch is vs v1.8 but it is for review only, i will port > > it to the trunk */ > > > > imho, there is no ABI break with this patch. > > out of curiosity, what is the ABI break that is/will be introduced with > > the "real" fix ? > > I haven't looked at the code at all, but Nathan said it changed the topo > framework interface. That's the ABI I was referring to -- not the MPI ABI. > Sorry for the confusions. > > > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > > test suite in order to test > > graph and dist_graph > > > > Cheers, > > > > Gilles > > > > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: > >> On the call today, we decided that Nathan's v1.8 patch is sufficient for > >> the v1.8 series, and the "real" fix will be applied to the trunk and > >> carried forward to the v1.9 series (since it introduces an ABI break for > >> the topo framework, which we try not to do in the middle of a release > >> series). > >> > >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: > >> > >>> Not quite right. There still is no topology information at collective > >>> selection time for either graph or dist graph. > >>> > >>> -Nathan > >>> > >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: > Nathan, > > here is a revision of the previously attached patch, and that supports > graph and dist graph. > > Cheers, > > Gilles > > On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as > you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Nathan? On Oct 1, 2014, at 10:27 AM, George Bosilca wrote: > I see no change in the topo interface in any of the patches attached to this > thread. Is there any other patch related to this discussion? > > George. > > > >> On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres) wrote: >> >>> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet >>> wrote: >>> >>> oops, i missed graph and dist graph :-( >>> >>> the attached patch fixes this >>> /* it was straightforward for graph and a bit more challenging for dist >>> graph */ >>> >>> i am fine for Nathan's patch for v1.8 >>> /* v1.8 and trunk have already a different way to handle a memory leak */ >>> /* the attached patch is vs v1.8 but it is for review only, i will port >>> it to the trunk */ >>> >>> imho, there is no ABI break with this patch. >>> out of curiosity, what is the ABI break that is/will be introduced with >>> the "real" fix ? >> >> I haven't looked at the code at all, but Nathan said it changed the topo >> framework interface. That's the ABI I was referring to -- not the MPI ABI. >> Sorry for the confusions. >> >>> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm >>> test suite in order to test >>> graph and dist_graph >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: On the call today, we decided that Nathan's v1.8 patch is sufficient for the v1.8 series, and the "real" fix will be applied to the trunk and carried forward to the v1.9 series (since it introduces an ABI break for the topo framework, which we try not to do in the middle of a release series). On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: > Not quite right. There still is no topology information at collective > selection time for either graph or dist graph. > > -Nathan > >> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >> Nathan, >> >> here is a revision of the previously attached patch, and that supports >> graph and dist graph. >> >> Cheers, >> >> Gilles >> >> On 2014/09/30 0:05, Nathan Hjelm wrote: >> >> An equivalent change would need to be made for graph and dist graph as >> well. That will take a little more work. Also, I was avoiding changing >> anything in topo for 1.8. >> >> -Nathan >> >> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >> >> Nathan, >> >> why not just make the topology information available at that point as you >> described it ? >> >> the attached patch does this, could you please review it ? >> >> Cheers, >> >> Gilles >> >> On 2014/09/26 2:50, Nathan Hjelm wrote: >> >> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> >> >> I wish this was possible but the topology information is not available >> at that point. We may be able to change that but I don't see the work >> completing anytime soon. I committed an alternative fix as r32796 and >> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >> produces a SEGV. Let me know if you run into any more issues. >> >> >> -Nathan >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >> >> Index: ompi/mca/topo/base/topo_base_cart_create.c >> === >> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
I see no change in the topo interface in any of the patches attached to this thread. Is there any other patch related to this discussion? George. > On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres) wrote: > >> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet >> wrote: >> >> oops, i missed graph and dist graph :-( >> >> the attached patch fixes this >> /* it was straightforward for graph and a bit more challenging for dist >> graph */ >> >> i am fine for Nathan's patch for v1.8 >> /* v1.8 and trunk have already a different way to handle a memory leak */ >> /* the attached patch is vs v1.8 but it is for review only, i will port >> it to the trunk */ >> >> imho, there is no ABI break with this patch. >> out of curiosity, what is the ABI break that is/will be introduced with >> the "real" fix ? > > I haven't looked at the code at all, but Nathan said it changed the topo > framework interface. That's the ABI I was referring to -- not the MPI ABI. > Sorry for the confusions. > >> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm >> test suite in order to test >> graph and dist_graph >> >> Cheers, >> >> Gilles >> >> >>> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: >>> On the call today, we decided that Nathan's v1.8 patch is sufficient for >>> the v1.8 series, and the "real" fix will be applied to the trunk and >>> carried forward to the v1.9 series (since it introduces an ABI break for >>> the topo framework, which we try not to do in the middle of a release >>> series). >>> >>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: >>> Not quite right. There still is no topology information at collective selection time for either graph or dist graph. -Nathan > On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: > Nathan, > > here is a revision of the previously attached patch, and that supports > graph and dist graph. > > Cheers, > > Gilles > > On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > >if (OMPI_COMM_IS_INTER(comm)) { >size = ompi_comm_remote_size(comm); >} else { >size = ompi_comm_size(comm); >} >basic_module->mccb_num_reqs = size * 2; >basic_module->mccb_reqs = (ompi_request_t**) >malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > @@ -163,10 +163,18 @@ >return MPI_ERR_INTERN; >} > > +assert(NULL == new_comm->c_topo); > +assert(!(new_comm->c_flags & OMPI_COMM_CART));
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet wrote: > oops, i missed graph and dist graph :-( > > the attached patch fixes this > /* it was straightforward for graph and a bit more challenging for dist > graph */ > > i am fine for Nathan's patch for v1.8 > /* v1.8 and trunk have already a different way to handle a memory leak */ > /* the attached patch is vs v1.8 but it is for review only, i will port > it to the trunk */ > > imho, there is no ABI break with this patch. > out of curiosity, what is the ABI break that is/will be introduced with > the "real" fix ? I haven't looked at the code at all, but Nathan said it changed the topo framework interface. That's the ABI I was referring to -- not the MPI ABI. Sorry for the confusions. > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm > test suite in order to test > graph and dist_graph > > Cheers, > > Gilles > > > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: >> On the call today, we decided that Nathan's v1.8 patch is sufficient for the >> v1.8 series, and the "real" fix will be applied to the trunk and carried >> forward to the v1.9 series (since it introduces an ABI break for the topo >> framework, which we try not to do in the middle of a release series). >> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: >> >>> Not quite right. There still is no topology information at collective >>> selection time for either graph or dist graph. >>> >>> -Nathan >>> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: Nathan, here is a revision of the previously attached patch, and that supports graph and dist graph. Cheers, Gilles On 2014/09/30 0:05, Nathan Hjelm wrote: An equivalent change would need to be made for graph and dist graph as well. That will take a little more work. Also, I was avoiding changing anything in topo for 1.8. -Nathan On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: Nathan, why not just make the topology information available at that point as you described it ? the attached patch does this, could you please review it ? Cheers, Gilles On 2014/09/26 2:50, Nathan Hjelm wrote: On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: I finally managed to track down some issues in mpi4py's test suite using Open MPI 1.8+. The code below should be enough to reproduce the problem. Run it under valgrind to make sense of my following diagnostics. In this code I'm creating a 2D, periodic Cartesian topology out of COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out links to itself. So we have size=1 but indegree=outdegree=4. However, in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are being allocated to manage communication: if (OMPI_COMM_IS_INTER(comm)) { size = ompi_comm_remote_size(comm); } else { size = ompi_comm_size(comm); } basic_module->mccb_num_reqs = size * 2; basic_module->mccb_reqs = (ompi_request_t**) malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); I guess you have to also special-case for topologies and allocate indegree+outdegree requests (not sure about this number, just guessing). I wish this was possible but the topology information is not available at that point. We may be able to change that but I don't see the work completing anytime soon. I committed an alternative fix as r32796 and CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer produces a SEGV. Let me know if you run into any more issues. -Nathan ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/09/15915.php Index: ompi/mca/topo/base/topo_base_cart_create.c === --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) @@ -163,10 +163,18 @@ return MPI_ERR_INTERN; } +assert(NULL == new_comm->c_topo); +assert(!(new_comm->c_flags & OMPI_COMM_CART)); +new_comm->c_topo = topo; +new_comm->c_topo->mtc.cart = cart; +new_comm->c_topo->reorder = reorder; +new_comm->c_flags |= OMPI_COMM_CART; ret = ompi_comm_enable(old_comm, new_comm, new_rank, num_procs, topo_procs); if (OMPI_SUCCESS != r
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
oops, i missed graph and dist graph :-( the attached patch fixes this /* it was straightforward for graph and a bit more challenging for dist graph */ i am fine for Nathan's patch for v1.8 /* v1.8 and trunk have already a different way to handle a memory leak */ /* the attached patch is vs v1.8 but it is for review only, i will port it to the trunk */ imho, there is no ABI break with this patch. out of curiosity, what is the ABI break that is/will be introduced with the "real" fix ? FYI, i also enhanced the collective/neighbor_allgather_self from the ibm test suite in order to test graph and dist_graph Cheers, Gilles On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote: > On the call today, we decided that Nathan's v1.8 patch is sufficient for the > v1.8 series, and the "real" fix will be applied to the trunk and carried > forward to the v1.9 series (since it introduces an ABI break for the topo > framework, which we try not to do in the middle of a release series). > > On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: > >> Not quite right. There still is no topology information at collective >> selection time for either graph or dist graph. >> >> -Nathan >> >> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >>> Nathan, >>> >>> here is a revision of the previously attached patch, and that supports >>> graph and dist graph. >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/30 0:05, Nathan Hjelm wrote: >>> >>> An equivalent change would need to be made for graph and dist graph as >>> well. That will take a little more work. Also, I was avoiding changing >>> anything in topo for 1.8. >>> >>> -Nathan >>> >>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >>> >>>Nathan, >>> >>>why not just make the topology information available at that point as you >>>described it ? >>> >>>the attached patch does this, could you please review it ? >>> >>>Cheers, >>> >>>Gilles >>> >>>On 2014/09/26 2:50, Nathan Hjelm wrote: >>> >>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >>> >>> I finally managed to track down some issues in mpi4py's test suite >>> using Open MPI 1.8+. The code below should be enough to reproduce the >>> problem. Run it under valgrind to make sense of my following >>> diagnostics. >>> >>> In this code I'm creating a 2D, periodic Cartesian topology out of >>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >>> links to itself. So we have size=1 but indegree=outdegree=4. However, >>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >>> being allocated to manage communication: >>> >>> if (OMPI_COMM_IS_INTER(comm)) { >>> size = ompi_comm_remote_size(comm); >>> } else { >>> size = ompi_comm_size(comm); >>> } >>> basic_module->mccb_num_reqs = size * 2; >>> basic_module->mccb_reqs = (ompi_request_t**) >>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >>> >>> I guess you have to also special-case for topologies and allocate >>> indegree+outdegree requests (not sure about this number, just >>> guessing). >>> >>> >>> I wish this was possible but the topology information is not available >>> at that point. We may be able to change that but I don't see the work >>> completing anytime soon. I committed an alternative fix as r32796 and >>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >>> produces a SEGV. Let me know if you run into any more issues. >>> >>> >>> -Nathan >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >>> >>> Index: ompi/mca/topo/base/topo_base_cart_create.c >>> === >>> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >>> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >>> @@ -163,10 +163,18 @@ >>> return MPI_ERR_INTERN; >>> } >>> >>> +assert(NULL == new_comm->c_topo); >>> +assert(!(new_comm->c_flags & OMPI_COMM_CART)); >>> +new_comm->c_topo = topo; >>> +new_comm->c_topo->mtc.cart = cart; >>> +new_comm->c_topo->reorder = reorder; >>> +new_comm->c_flags |= OMPI_COMM_CART; >>> ret = ompi_comm_enable(old_comm, new_comm, >>> new_rank, num_procs, topo_procs); >>> if (OMPI_SUCCESS != ret) { >>> /* something wrong happened during setting the communicator */ >>> +new_comm->c_topo = NULL; >>> +new_comm->c_flags &= ~OMPI_COMM_CART; >>> ompi_comm_free (&new_comm); >>> free(topo_procs); >>> if(NULL != cart->periods) free(cart->periods); >>> @@ -176,10 +184,6 @@ >>> return ret; >>> } >>> >
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On the call today, we decided that Nathan's v1.8 patch is sufficient for the v1.8 series, and the "real" fix will be applied to the trunk and carried forward to the v1.9 series (since it introduces an ABI break for the topo framework, which we try not to do in the middle of a release series). On Sep 30, 2014, at 10:40 AM, Nathan Hjelm wrote: > > Not quite right. There still is no topology information at collective > selection time for either graph or dist graph. > > -Nathan > > On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >> Nathan, >> >> here is a revision of the previously attached patch, and that supports >> graph and dist graph. >> >> Cheers, >> >> Gilles >> >> On 2014/09/30 0:05, Nathan Hjelm wrote: >> >> An equivalent change would need to be made for graph and dist graph as >> well. That will take a little more work. Also, I was avoiding changing >> anything in topo for 1.8. >> >> -Nathan >> >> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >> >>Nathan, >> >>why not just make the topology information available at that point as you >>described it ? >> >>the attached patch does this, could you please review it ? >> >>Cheers, >> >>Gilles >> >>On 2014/09/26 2:50, Nathan Hjelm wrote: >> >> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> >> >> I wish this was possible but the topology information is not available >> at that point. We may be able to change that but I don't see the work >> completing anytime soon. I committed an alternative fix as r32796 and >> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >> produces a SEGV. Let me know if you run into any more issues. >> >> >> -Nathan >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >> >> Index: ompi/mca/topo/base/topo_base_cart_create.c >> === >> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >> @@ -163,10 +163,18 @@ >> return MPI_ERR_INTERN; >> } >> >> +assert(NULL == new_comm->c_topo); >> +assert(!(new_comm->c_flags & OMPI_COMM_CART)); >> +new_comm->c_topo = topo; >> +new_comm->c_topo->mtc.cart = cart; >> +new_comm->c_topo->reorder = reorder; >> +new_comm->c_flags |= OMPI_COMM_CART; >> ret = ompi_comm_enable(old_comm, new_comm, >> new_rank, num_procs, topo_procs); >> if (OMPI_SUCCESS != ret) { >> /* something wrong happened during setting the communicator */ >> +new_comm->c_topo = NULL; >> +new_comm->c_flags &= ~OMPI_COMM_CART; >> ompi_comm_free (&new_comm); >> free(topo_procs); >> if(NULL != cart->periods) free(cart->periods); >> @@ -176,10 +184,6 @@ >> return ret; >> } >> >> -new_comm->c_topo = topo; >> -new_comm->c_topo->mtc.cart = cart; >> -new_comm->c_topo->reorder = reorder; >> -new_comm->c_flags |= OMPI_COMM_CART; >> *comm_topo = new_comm; >> >> if( MPI_UNDEFINED == new_rank ) { >> Index: ompi/mca/coll/basic/coll_basic_module.c >> === >> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) >> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) >> @@ -13,6 +13,8 @@ >> * Copyright (c) 2012 Sandia National Laboratories. All rights >> reserved. >> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights >> * reserved. >> + * Copyright (c) 2014 R
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Not quite right. There still is no topology information at collective selection time for either graph or dist graph. -Nathan On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote: >Nathan, > >here is a revision of the previously attached patch, and that supports >graph and dist graph. > >Cheers, > >Gilles > >On 2014/09/30 0:05, Nathan Hjelm wrote: > > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: > > Nathan, > > why not just make the topology information available at that point as you > described it ? > > the attached patch does this, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) > @@ -163,10 +163,18 @@ > return MPI_ERR_INTERN; > } > > +assert(NULL == new_comm->c_topo); > +assert(!(new_comm->c_flags & OMPI_COMM_CART)); > +new_comm->c_topo = topo; > +new_comm->c_topo->mtc.cart = cart; > +new_comm->c_topo->reorder = reorder; > +new_comm->c_flags |= OMPI_COMM_CART; > ret = ompi_comm_enable(old_comm, new_comm, > new_rank, num_procs, topo_procs); > if (OMPI_SUCCESS != ret) { > /* something wrong happened during setting the communicator */ > +new_comm->c_topo = NULL; > +new_comm->c_flags &= ~OMPI_COMM_CART; > ompi_comm_free (&new_comm); > free(topo_procs); > if(NULL != cart->periods) free(cart->periods); > @@ -176,10 +184,6 @@ > return ret; > } > > -new_comm->c_topo = topo; > -new_comm->c_topo->mtc.cart = cart; > -new_comm->c_topo->reorder = reorder; > -new_comm->c_flags |= OMPI_COMM_CART; > *comm_topo = new_comm; > > if( MPI_UNDEFINED == new_rank ) { > Index: ompi/mca/coll/basic/coll_basic_module.c > === > --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) > +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > @@ -13,6 +13,8 @@ >* Copyright (c) 2012 Sandia National Laboratories. All rights > reserved. >* Copyright (c) 2013 Los Alamos National Security, LLC. All rights >* reserved. > + * Copyright (c) 2014 Research Organization for Information Science > + * and Technology (RIST). All rights reserved. >* $COPYRIGHT$ >* >* Additional copyrights may follow > @@ -28,6 +30,7 @@ > #include "mpi.h" > #include "ompi/mca/coll/coll.h" > #include "ompi/mca/coll/base/base.h" > +#include "ompi/mca/topo/topo.h" > #include "coll_basic.h" > > > @@ -70,6 +
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Nathan, here is a revision of the previously attached patch, and that supports graph and dist graph. Cheers, Gilles On 2014/09/30 0:05, Nathan Hjelm wrote: > An equivalent change would need to be made for graph and dist graph as > well. That will take a little more work. Also, I was avoiding changing > anything in topo for 1.8. > > -Nathan > > On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >>Nathan, >> >>why not just make the topology information available at that point as you >>described it ? >> >>the attached patch does this, could you please review it ? >> >>Cheers, >> >>Gilles >> >>On 2014/09/26 2:50, Nathan Hjelm wrote: >> >> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> >> >> I wish this was possible but the topology information is not available >> at that point. We may be able to change that but I don't see the work >> completing anytime soon. I committed an alternative fix as r32796 and >> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer >> produces a SEGV. Let me know if you run into any more issues. >> >> >> -Nathan >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php >> Index: ompi/mca/topo/base/topo_base_cart_create.c >> === >> --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) >> +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) >> @@ -163,10 +163,18 @@ >> return MPI_ERR_INTERN; >> } >> >> +assert(NULL == new_comm->c_topo); >> +assert(!(new_comm->c_flags & OMPI_COMM_CART)); >> +new_comm->c_topo = topo; >> +new_comm->c_topo->mtc.cart = cart; >> +new_comm->c_topo->reorder = reorder; >> +new_comm->c_flags |= OMPI_COMM_CART; >> ret = ompi_comm_enable(old_comm, new_comm, >> new_rank, num_procs, topo_procs); >> if (OMPI_SUCCESS != ret) { >> /* something wrong happened during setting the communicator */ >> +new_comm->c_topo = NULL; >> +new_comm->c_flags &= ~OMPI_COMM_CART; >> ompi_comm_free (&new_comm); >> free(topo_procs); >> if(NULL != cart->periods) free(cart->periods); >> @@ -176,10 +184,6 @@ >> return ret; >> } >> >> -new_comm->c_topo = topo; >> -new_comm->c_topo->mtc.cart = cart; >> -new_comm->c_topo->reorder = reorder; >> -new_comm->c_flags |= OMPI_COMM_CART; >> *comm_topo = new_comm; >> >> if( MPI_UNDEFINED == new_rank ) { >> Index: ompi/mca/coll/basic/coll_basic_module.c >> === >> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) >> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) >> @@ -13,6 +13,8 @@ >> * Copyright (c) 2012 Sandia National Laboratories. All rights >> reserved. >> * Copyright (c) 2013 Los Alamos National Security, LLC. All rights >> * reserved. >> + * Copyright (c) 2014 Research Organization for Information Science >> + * and Technology (RIST). All rights reserved. >> * $COPYRIGHT$ >> * >> * Additional copyrights may follow >> @@ -28,6 +30,7 @@ >> #include "mpi.h" >> #include "ompi/mca/coll/coll.h" >> #include "ompi/mca/coll/base/base.h" >> +#include "ompi/mca/topo/topo.h" >> #include "coll_basic.h" >> >> >> @@ -70,6 +73,15 @@ >> } else { >> size = ompi_comm_size(comm); >> } >> +if (comm->c_flags & OMPI_COMM_CART) { >> +int cart_size; >> +assert (NULL != comm->c_topo); >> +comm->c_topo->topo.cart.cartdim_ge
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
An equivalent change would need to be made for graph and dist graph as well. That will take a little more work. Also, I was avoiding changing anything in topo for 1.8. -Nathan On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote: >Nathan, > >why not just make the topology information available at that point as you >described it ? > >the attached patch does this, could you please review it ? > >Cheers, > >Gilles > >On 2014/09/26 2:50, Nathan Hjelm wrote: > > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php > Index: ompi/mca/topo/base/topo_base_cart_create.c > === > --- ompi/mca/topo/base/topo_base_cart_create.c(revision 32807) > +++ ompi/mca/topo/base/topo_base_cart_create.c(working copy) > @@ -163,10 +163,18 @@ > return MPI_ERR_INTERN; > } > > +assert(NULL == new_comm->c_topo); > +assert(!(new_comm->c_flags & OMPI_COMM_CART)); > +new_comm->c_topo = topo; > +new_comm->c_topo->mtc.cart = cart; > +new_comm->c_topo->reorder = reorder; > +new_comm->c_flags |= OMPI_COMM_CART; > ret = ompi_comm_enable(old_comm, new_comm, > new_rank, num_procs, topo_procs); > if (OMPI_SUCCESS != ret) { > /* something wrong happened during setting the communicator */ > +new_comm->c_topo = NULL; > +new_comm->c_flags &= ~OMPI_COMM_CART; > ompi_comm_free (&new_comm); > free(topo_procs); > if(NULL != cart->periods) free(cart->periods); > @@ -176,10 +184,6 @@ > return ret; > } > > -new_comm->c_topo = topo; > -new_comm->c_topo->mtc.cart = cart; > -new_comm->c_topo->reorder = reorder; > -new_comm->c_flags |= OMPI_COMM_CART; > *comm_topo = new_comm; > > if( MPI_UNDEFINED == new_rank ) { > Index: ompi/mca/coll/basic/coll_basic_module.c > === > --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) > +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) > @@ -13,6 +13,8 @@ > * Copyright (c) 2012 Sandia National Laboratories. All rights reserved. > * Copyright (c) 2013 Los Alamos National Security, LLC. All rights > * reserved. > + * Copyright (c) 2014 Research Organization for Information Science > + * and Technology (RIST). All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -28,6 +30,7 @@ > #include "mpi.h" > #include "ompi/mca/coll/coll.h" > #include "ompi/mca/coll/base/base.h" > +#include "ompi/mca/topo/topo.h" > #include "coll_basic.h" > > > @@ -70,6 +73,15 @@ > } else { > size = ompi_comm_size(comm); > } > +if (comm->c_flags & OMPI_COMM_CART) { > +int cart_size; > +assert (NULL != comm->c_topo); > +comm->c_topo->topo.cart.cartdim_get(comm, &cart_size); > +cart_size *= 2; > +if (cart_size > size) { > +size = cart_size; > +} > +} > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_reques
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Nathan, why not just make the topology information available at that point as you described it ? the attached patch does this, could you please review it ? Cheers, Gilles On 2014/09/26 2:50, Nathan Hjelm wrote: > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > > > -Nathan > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15915.php Index: ompi/mca/topo/base/topo_base_cart_create.c === --- ompi/mca/topo/base/topo_base_cart_create.c (revision 32807) +++ ompi/mca/topo/base/topo_base_cart_create.c (working copy) @@ -163,10 +163,18 @@ return MPI_ERR_INTERN; } +assert(NULL == new_comm->c_topo); +assert(!(new_comm->c_flags & OMPI_COMM_CART)); +new_comm->c_topo = topo; +new_comm->c_topo->mtc.cart = cart; +new_comm->c_topo->reorder = reorder; +new_comm->c_flags |= OMPI_COMM_CART; ret = ompi_comm_enable(old_comm, new_comm, new_rank, num_procs, topo_procs); if (OMPI_SUCCESS != ret) { /* something wrong happened during setting the communicator */ +new_comm->c_topo = NULL; +new_comm->c_flags &= ~OMPI_COMM_CART; ompi_comm_free (&new_comm); free(topo_procs); if(NULL != cart->periods) free(cart->periods); @@ -176,10 +184,6 @@ return ret; } -new_comm->c_topo = topo; -new_comm->c_topo->mtc.cart = cart; -new_comm->c_topo->reorder = reorder; -new_comm->c_flags |= OMPI_COMM_CART; *comm_topo = new_comm; if( MPI_UNDEFINED == new_rank ) { Index: ompi/mca/coll/basic/coll_basic_module.c === --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807) +++ ompi/mca/coll/basic/coll_basic_module.c (working copy) @@ -13,6 +13,8 @@ * Copyright (c) 2012 Sandia National Laboratories. All rights reserved. * Copyright (c) 2013 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -28,6 +30,7 @@ #include "mpi.h" #include "ompi/mca/coll/coll.h" #include "ompi/mca/coll/base/base.h" +#include "ompi/mca/topo/topo.h" #include "coll_basic.h" @@ -70,6 +73,15 @@ } else { size = ompi_comm_size(comm); } +if (comm->c_flags & OMPI_COMM_CART) { +int cart_size; +assert (NULL != comm->c_topo); +comm->c_topo->topo.cart.cartdim_get(comm, &cart_size); +cart_size *= 2; +if (cart_size > size) { +size = cart_size; +} +} basic_module->mccb_num_reqs = size * 2; basic_module->mccb_reqs = (ompi_request_t**) malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On 25 September 2014 20:50, Nathan Hjelm wrote: > On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: >> I finally managed to track down some issues in mpi4py's test suite >> using Open MPI 1.8+. The code below should be enough to reproduce the >> problem. Run it under valgrind to make sense of my following >> diagnostics. >> >> In this code I'm creating a 2D, periodic Cartesian topology out of >> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out >> links to itself. So we have size=1 but indegree=outdegree=4. However, >> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are >> being allocated to manage communication: >> >> if (OMPI_COMM_IS_INTER(comm)) { >> size = ompi_comm_remote_size(comm); >> } else { >> size = ompi_comm_size(comm); >> } >> basic_module->mccb_num_reqs = size * 2; >> basic_module->mccb_reqs = (ompi_request_t**) >> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); >> >> I guess you have to also special-case for topologies and allocate >> indegree+outdegree requests (not sure about this number, just >> guessing). >> > > I wish this was possible but the topology information is not available > at that point. We may be able to change that but I don't see the work > completing anytime soon. I committed an alternative fix as r32796 and > CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer > produces a SEGV. Let me know if you run into any more issues. > Did your fix get in for 1.8.3? I'm still getting the segfault. -- Lisandro Dalcin Research Scientist Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Numerical Porous Media Center (NumPor) King Abdullah University of Science and Technology (KAUST) http://numpor.kaust.edu.sa/ 4700 King Abdullah University of Science and Technology al-Khawarizmi Bldg (Bldg 1), Office # 4332 Thuwal 23955-6900, Kingdom of Saudi Arabia http://www.kaust.edu.sa Office Phone: +966 12 808-0459
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > I wish this was possible but the topology information is not available at that point. We may be able to change that but I don't see the work completing anytime soon. I committed an alternative fix as r32796 and CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer produces a SEGV. Let me know if you run into any more issues. -Nathan pgpiboDbxhbSj.pgp Description: PGP signature
Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
Good catch. I will take a look and see how best to fix this. -Nathan On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote: > I finally managed to track down some issues in mpi4py's test suite > using Open MPI 1.8+. The code below should be enough to reproduce the > problem. Run it under valgrind to make sense of my following > diagnostics. > > In this code I'm creating a 2D, periodic Cartesian topology out of > COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out > links to itself. So we have size=1 but indegree=outdegree=4. However, > in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are > being allocated to manage communication: > > if (OMPI_COMM_IS_INTER(comm)) { > size = ompi_comm_remote_size(comm); > } else { > size = ompi_comm_size(comm); > } > basic_module->mccb_num_reqs = size * 2; > basic_module->mccb_reqs = (ompi_request_t**) > malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); > > I guess you have to also special-case for topologies and allocate > indegree+outdegree requests (not sure about this number, just > guessing). > > > #include > #include > > int main(int argc, char *argv[]) > { > MPI_Comm comm; > int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1}; > int sendbuf = 7, recvbuf[5] = {0,0,0,0,0}; > MPI_Init(&argc, &argv); > > MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, &comm); > > MPI_Neighbor_allgather(&sendbuf, 1, MPI_INT, > recvbuf, 1, MPI_INT, > comm); > > {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");} > > MPI_Finalize(); > return 0; > } > > > -- > Lisandro Dalcin > > Research Scientist > Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) > Numerical Porous Media Center (NumPor) > King Abdullah University of Science and Technology (KAUST) > http://numpor.kaust.edu.sa/ > > 4700 King Abdullah University of Science and Technology > al-Khawarizmi Bldg (Bldg 1), Office # 4332 > Thuwal 23955-6900, Kingdom of Saudi Arabia > http://www.kaust.edu.sa > > Office Phone: +966 12 808-0459 > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15713.php pgp4ELRAwv940.pgp Description: PGP signature
[OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one
I finally managed to track down some issues in mpi4py's test suite using Open MPI 1.8+. The code below should be enough to reproduce the problem. Run it under valgrind to make sense of my following diagnostics. In this code I'm creating a 2D, periodic Cartesian topology out of COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out links to itself. So we have size=1 but indegree=outdegree=4. However, in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are being allocated to manage communication: if (OMPI_COMM_IS_INTER(comm)) { size = ompi_comm_remote_size(comm); } else { size = ompi_comm_size(comm); } basic_module->mccb_num_reqs = size * 2; basic_module->mccb_reqs = (ompi_request_t**) malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs); I guess you have to also special-case for topologies and allocate indegree+outdegree requests (not sure about this number, just guessing). #include #include int main(int argc, char *argv[]) { MPI_Comm comm; int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1}; int sendbuf = 7, recvbuf[5] = {0,0,0,0,0}; MPI_Init(&argc, &argv); MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, &comm); MPI_Neighbor_allgather(&sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, comm); {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");} MPI_Finalize(); return 0; } -- Lisandro Dalcin Research Scientist Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Numerical Porous Media Center (NumPor) King Abdullah University of Science and Technology (KAUST) http://numpor.kaust.edu.sa/ 4700 King Abdullah University of Science and Technology al-Khawarizmi Bldg (Bldg 1), Office # 4332 Thuwal 23955-6900, Kingdom of Saudi Arabia http://www.kaust.edu.sa Office Phone: +966 12 808-0459