Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Nathan Hjelm
I should have been clearer. This changes how the topo components are
expected to work. With this change they now MUST provide the c_topo to
the communicator before coll_select. The change is good but may be
unexpected for any vendor topo components out there (which I doubt
there are any).

-Nathan

On Wed, Oct 01, 2014 at 12:52:00PM +, Jeff Squyres (jsquyres) wrote:
> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
>  wrote:
> 
> > oops, i missed graph and dist graph :-(
> > 
> > the attached patch fixes this
> > /* it was straightforward for graph and a bit more challenging for dist
> > graph */
> > 
> > i am fine for Nathan's patch for v1.8
> > /* v1.8 and trunk have already a different way to handle a memory leak */
> > /* the attached patch is vs v1.8 but it is for review only, i will port
> > it to the trunk */
> > 
> > imho, there is no ABI break with this patch.
> > out of curiosity, what is the ABI break that is/will be introduced with
> > the "real" fix ?
> 
> I haven't looked at the code at all, but Nathan said it changed the topo 
> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
> Sorry for the confusions.
> 
> > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
> > test suite in order to test
> > graph and dist_graph
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > 
> > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
> >> On the call today, we decided that Nathan's v1.8 patch is sufficient for 
> >> the v1.8 series, and the "real" fix will be applied to the trunk and 
> >> carried forward to the v1.9 series (since it introduces an ABI break for 
> >> the topo framework, which we try not to do in the middle of a release 
> >> series).
> >> 
> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
> >> 
> >>> Not quite right. There still is no topology information at collective
> >>> selection time for either graph or dist graph.
> >>> 
> >>> -Nathan
> >>> 
> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>   Nathan,
>  
>   here is a revision of the previously attached patch, and that supports
>   graph and dist graph.
>  
>   Cheers,
>  
>   Gilles
>  
>   On 2014/09/30 0:05, Nathan Hjelm wrote:
>  
>  An equivalent change would need to be made for graph and dist graph as
>  well. That will take a little more work. Also, I was avoiding changing
>  anything in topo for 1.8.
>  
>  -Nathan
>  
>  On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>  
>    Nathan,
>  
>    why not just make the topology information available at that point as 
>  you
>    described it ?
>  
>    the attached patch does this, could you please review it ?
>  
>    Cheers,
>  
>    Gilles
>  
>    On 2014/09/26 2:50, Nathan Hjelm wrote:
>  
>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>  
>  I finally managed to track down some issues in mpi4py's test suite
>  using Open MPI 1.8+. The code below should be enough to reproduce the
>  problem. Run it under valgrind to make sense of my following
>  diagnostics.
>  
>  In this code I'm creating a 2D, periodic Cartesian topology out of
>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>  being allocated to manage communication:
>  
>  if (OMPI_COMM_IS_INTER(comm)) {
>  size = ompi_comm_remote_size(comm);
>  } else {
>  size = ompi_comm_size(comm);
>  }
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**)
>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>  
>  I guess you have to also special-case for topologies and allocate
>  indegree+outdegree requests (not sure about this number, just
>  guessing).
>  
>  
>  I wish this was possible but the topology information is not available
>  at that point. We may be able to change that but I don't see the work
>  completing anytime soon. I committed an alternative fix as r32796 and
>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>  produces a SEGV. Let me know if you run into any more issues.
>  
>  
>  -Nathan
>  
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post: 
>  http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>  
>  Index: ompi/mca/topo/base/topo_base_cart_create.c
>  

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread George Bosilca
I see no change in the topo interface in any of the patches attached to this 
thread. Is there any other patch related to this discussion?

  George.



> On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres)  wrote:
> 
>> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>> oops, i missed graph and dist graph :-(
>> 
>> the attached patch fixes this
>> /* it was straightforward for graph and a bit more challenging for dist
>> graph */
>> 
>> i am fine for Nathan's patch for v1.8
>> /* v1.8 and trunk have already a different way to handle a memory leak */
>> /* the attached patch is vs v1.8 but it is for review only, i will port
>> it to the trunk */
>> 
>> imho, there is no ABI break with this patch.
>> out of curiosity, what is the ABI break that is/will be introduced with
>> the "real" fix ?
> 
> I haven't looked at the code at all, but Nathan said it changed the topo 
> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
> Sorry for the confusions.
> 
>> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
>> test suite in order to test
>> graph and dist_graph
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>>> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
>>> On the call today, we decided that Nathan's v1.8 patch is sufficient for 
>>> the v1.8 series, and the "real" fix will be applied to the trunk and 
>>> carried forward to the v1.9 series (since it introduces an ABI break for 
>>> the topo framework, which we try not to do in the middle of a release 
>>> series).
>>> 
>>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>>> 
 Not quite right. There still is no topology information at collective
 selection time for either graph or dist graph.
 
 -Nathan
 
> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
> Nathan,
> 
> here is a revision of the previously attached patch, and that supports
> graph and dist graph.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/30 0:05, Nathan Hjelm wrote:
> 
> An equivalent change would need to be made for graph and dist graph as
> well. That will take a little more work. Also, I was avoiding changing
> anything in topo for 1.8.
> 
> -Nathan
> 
> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
> 
>  Nathan,
> 
>  why not just make the topology information available at that point as you
>  described it ?
> 
>  the attached patch does this, could you please review it ?
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
> I finally managed to track down some issues in mpi4py's test suite
> using Open MPI 1.8+. The code below should be enough to reproduce the
> problem. Run it under valgrind to make sense of my following
> diagnostics.
> 
> In this code I'm creating a 2D, periodic Cartesian topology out of
> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> links to itself. So we have size=1 but indegree=outdegree=4. However,
> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> being allocated to manage communication:
> 
>if (OMPI_COMM_IS_INTER(comm)) {
>size = ompi_comm_remote_size(comm);
>} else {
>size = ompi_comm_size(comm);
>}
>basic_module->mccb_num_reqs = size * 2;
>basic_module->mccb_reqs = (ompi_request_t**)
>malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
> I guess you have to also special-case for topologies and allocate
> indegree+outdegree requests (not sure about this number, just
> guessing).
> 
> 
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
> 
> 
> -Nathan
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
> 
> Index: ompi/mca/topo/base/topo_base_cart_create.c
> ===
> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
> @@ -163,10 +163,18 @@
>return MPI_ERR_INTERN;
>}
> 
> +assert(NULL == 

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Jeff Squyres (jsquyres)
On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet  
wrote:

> oops, i missed graph and dist graph :-(
> 
> the attached patch fixes this
> /* it was straightforward for graph and a bit more challenging for dist
> graph */
> 
> i am fine for Nathan's patch for v1.8
> /* v1.8 and trunk have already a different way to handle a memory leak */
> /* the attached patch is vs v1.8 but it is for review only, i will port
> it to the trunk */
> 
> imho, there is no ABI break with this patch.
> out of curiosity, what is the ABI break that is/will be introduced with
> the "real" fix ?

I haven't looked at the code at all, but Nathan said it changed the topo 
framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
Sorry for the confusions.

> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
> test suite in order to test
> graph and dist_graph
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
>> On the call today, we decided that Nathan's v1.8 patch is sufficient for the 
>> v1.8 series, and the "real" fix will be applied to the trunk and carried 
>> forward to the v1.9 series (since it introduces an ABI break for the topo 
>> framework, which we try not to do in the middle of a release series).
>> 
>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>> 
>>> Not quite right. There still is no topology information at collective
>>> selection time for either graph or dist graph.
>>> 
>>> -Nathan
>>> 
>>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
  Nathan,
 
  here is a revision of the previously attached patch, and that supports
  graph and dist graph.
 
  Cheers,
 
  Gilles
 
  On 2014/09/30 0:05, Nathan Hjelm wrote:
 
 An equivalent change would need to be made for graph and dist graph as
 well. That will take a little more work. Also, I was avoiding changing
 anything in topo for 1.8.
 
 -Nathan
 
 On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
 
   Nathan,
 
   why not just make the topology information available at that point as you
   described it ?
 
   the attached patch does this, could you please review it ?
 
   Cheers,
 
   Gilles
 
   On 2014/09/26 2:50, Nathan Hjelm wrote:
 
 On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
 
 I finally managed to track down some issues in mpi4py's test suite
 using Open MPI 1.8+. The code below should be enough to reproduce the
 problem. Run it under valgrind to make sense of my following
 diagnostics.
 
 In this code I'm creating a 2D, periodic Cartesian topology out of
 COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
 links to itself. So we have size=1 but indegree=outdegree=4. However,
 in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
 being allocated to manage communication:
 
 if (OMPI_COMM_IS_INTER(comm)) {
 size = ompi_comm_remote_size(comm);
 } else {
 size = ompi_comm_size(comm);
 }
 basic_module->mccb_num_reqs = size * 2;
 basic_module->mccb_reqs = (ompi_request_t**)
 malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
 
 I guess you have to also special-case for topologies and allocate
 indegree+outdegree requests (not sure about this number, just
 guessing).
 
 
 I wish this was possible but the topology information is not available
 at that point. We may be able to change that but I don't see the work
 completing anytime soon. I committed an alternative fix as r32796 and
 CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
 produces a SEGV. Let me know if you run into any more issues.
 
 
 -Nathan
 
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
 
 Index: ompi/mca/topo/base/topo_base_cart_create.c
 ===
 --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
 +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
 @@ -163,10 +163,18 @@
 return MPI_ERR_INTERN;
 }
 
 +assert(NULL == new_comm->c_topo);
 +assert(!(new_comm->c_flags & OMPI_COMM_CART));
 +new_comm->c_topo   = topo;
 +new_comm->c_topo->mtc.cart = cart;
 +new_comm->c_topo->reorder  = reorder;
 +new_comm->c_flags |= OMPI_COMM_CART;
 ret = ompi_comm_enable(old_comm, new_comm,
new_rank, 

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Gilles Gouaillardet
oops, i missed graph and dist graph :-(

the attached patch fixes this
/* it was straightforward for graph and a bit more challenging for dist
graph */

i am fine for Nathan's patch for v1.8
/* v1.8 and trunk have already a different way to handle a memory leak */
/* the attached patch is vs v1.8 but it is for review only, i will port
it to the trunk */

imho, there is no ABI break with this patch.
out of curiosity, what is the ABI break that is/will be introduced with
the "real" fix ?

FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
test suite in order to test
graph and dist_graph

Cheers,

Gilles


On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
> On the call today, we decided that Nathan's v1.8 patch is sufficient for the 
> v1.8 series, and the "real" fix will be applied to the trunk and carried 
> forward to the v1.9 series (since it introduces an ABI break for the topo 
> framework, which we try not to do in the middle of a release series).
>
> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>
>> Not quite right. There still is no topology information at collective
>> selection time for either graph or dist graph.
>>
>> -Nathan
>>
>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>>>   Nathan,
>>>
>>>   here is a revision of the previously attached patch, and that supports
>>>   graph and dist graph.
>>>
>>>   Cheers,
>>>
>>>   Gilles
>>>
>>>   On 2014/09/30 0:05, Nathan Hjelm wrote:
>>>
>>> An equivalent change would need to be made for graph and dist graph as
>>> well. That will take a little more work. Also, I was avoiding changing
>>> anything in topo for 1.8.
>>>
>>> -Nathan
>>>
>>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>>>
>>>Nathan,
>>>
>>>why not just make the topology information available at that point as you
>>>described it ?
>>>
>>>the attached patch does this, could you please review it ?
>>>
>>>Cheers,
>>>
>>>Gilles
>>>
>>>On 2014/09/26 2:50, Nathan Hjelm wrote:
>>>
>>>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>>>
>>>  I finally managed to track down some issues in mpi4py's test suite
>>>  using Open MPI 1.8+. The code below should be enough to reproduce the
>>>  problem. Run it under valgrind to make sense of my following
>>>  diagnostics.
>>>
>>>  In this code I'm creating a 2D, periodic Cartesian topology out of
>>>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>>>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>>>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>>>  being allocated to manage communication:
>>>
>>>  if (OMPI_COMM_IS_INTER(comm)) {
>>>  size = ompi_comm_remote_size(comm);
>>>  } else {
>>>  size = ompi_comm_size(comm);
>>>  }
>>>  basic_module->mccb_num_reqs = size * 2;
>>>  basic_module->mccb_reqs = (ompi_request_t**)
>>>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>>
>>>  I guess you have to also special-case for topologies and allocate
>>>  indegree+outdegree requests (not sure about this number, just
>>>  guessing).
>>>
>>>
>>>  I wish this was possible but the topology information is not available
>>>  at that point. We may be able to change that but I don't see the work
>>>  completing anytime soon. I committed an alternative fix as r32796 and
>>>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>>>  produces a SEGV. Let me know if you run into any more issues.
>>>
>>>
>>>  -Nathan
>>>
>>>  ___
>>>  devel mailing list
>>>  de...@open-mpi.org
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>>>
>>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>>> ===
>>> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>>> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
>>> @@ -163,10 +163,18 @@
>>>  return MPI_ERR_INTERN;
>>>  }
>>>
>>> +assert(NULL == new_comm->c_topo);
>>> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>>> +new_comm->c_topo   = topo;
>>> +new_comm->c_topo->mtc.cart = cart;
>>> +new_comm->c_topo->reorder  = reorder;
>>> +new_comm->c_flags |= OMPI_COMM_CART;
>>>  ret = ompi_comm_enable(old_comm, new_comm,
>>> new_rank, num_procs, topo_procs);
>>>  if (OMPI_SUCCESS != ret) {
>>>  /* something wrong happened during setting the communicator */
>>> +new_comm->c_topo = NULL;
>>> +new_comm->c_flags &= ~OMPI_COMM_CART;
>>>  ompi_comm_free (_comm);
>>>  free(topo_procs);
>>>  if(NULL != cart->periods) free(cart->periods);
>>> @@ -176,10 +184,6 @@
>>>  return ret;

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-30 Thread Nathan Hjelm

Not quite right. There still is no topology information at collective
selection time for either graph or dist graph.

-Nathan

On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>Nathan,
> 
>here is a revision of the previously attached patch, and that supports
>graph and dist graph.
> 
>Cheers,
> 
>Gilles
> 
>On 2014/09/30 0:05, Nathan Hjelm wrote:
> 
>  An equivalent change would need to be made for graph and dist graph as
>  well. That will take a little more work. Also, I was avoiding changing
>  anything in topo for 1.8.
> 
>  -Nathan
> 
>  On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
> 
> Nathan,
> 
> why not just make the topology information available at that point as you
> described it ?
> 
> the attached patch does this, could you please review it ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
>   On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
>   I finally managed to track down some issues in mpi4py's test suite
>   using Open MPI 1.8+. The code below should be enough to reproduce the
>   problem. Run it under valgrind to make sense of my following
>   diagnostics.
> 
>   In this code I'm creating a 2D, periodic Cartesian topology out of
>   COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>   links to itself. So we have size=1 but indegree=outdegree=4. However,
>   in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>   being allocated to manage communication:
> 
>   if (OMPI_COMM_IS_INTER(comm)) {
>   size = ompi_comm_remote_size(comm);
>   } else {
>   size = ompi_comm_size(comm);
>   }
>   basic_module->mccb_num_reqs = size * 2;
>   basic_module->mccb_reqs = (ompi_request_t**)
>   malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
>   I guess you have to also special-case for topologies and allocate
>   indegree+outdegree requests (not sure about this number, just
>   guessing).
> 
> 
>   I wish this was possible but the topology information is not available
>   at that point. We may be able to change that but I don't see the work
>   completing anytime soon. I committed an alternative fix as r32796 and
>   CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>   produces a SEGV. Let me know if you run into any more issues.
> 
> 
>   -Nathan
> 
>   ___
>   devel mailing list
>   de...@open-mpi.org
>   Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>   Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
> 
>  Index: ompi/mca/topo/base/topo_base_cart_create.c
>  ===
>  --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>  +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
>  @@ -163,10 +163,18 @@
>   return MPI_ERR_INTERN;
>   }
>  
>  +assert(NULL == new_comm->c_topo);
>  +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>  +new_comm->c_topo   = topo;
>  +new_comm->c_topo->mtc.cart = cart;
>  +new_comm->c_topo->reorder  = reorder;
>  +new_comm->c_flags |= OMPI_COMM_CART;
>   ret = ompi_comm_enable(old_comm, new_comm,
>  new_rank, num_procs, topo_procs);
>   if (OMPI_SUCCESS != ret) {
>   /* something wrong happened during setting the communicator */
>  +new_comm->c_topo = NULL;
>  +new_comm->c_flags &= ~OMPI_COMM_CART;
>   ompi_comm_free (_comm);
>   free(topo_procs);
>   if(NULL != cart->periods) free(cart->periods);
>  @@ -176,10 +184,6 @@
>   return ret;
>   }
>  
>  -new_comm->c_topo   = topo;
>  -new_comm->c_topo->mtc.cart = cart;
>  -new_comm->c_topo->reorder  = reorder;
>  -new_comm->c_flags |= OMPI_COMM_CART;
>   *comm_topo = new_comm;
>  
>   if( MPI_UNDEFINED == new_rank ) {
>  Index: ompi/mca/coll/basic/coll_basic_module.c
>  ===
>  --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807)
>  +++ ompi/mca/coll/basic/coll_basic_module.c (working copy)
>  @@ -13,6 +13,8 @@
>* Copyright (c) 2012  Sandia National Laboratories. All rights 
> reserved.
>* Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>* reserved.
>  + * Copyright (c) 2014  Research Organization for Information Science
>  + * and Technology (RIST). All rights reserved.
>* $COPYRIGHT$
>*
>* Additional copyrights may follow
>  @@ -28,6 +30,7 @@
>   #include "mpi.h"
>   #include "ompi/mca/coll/coll.h"
>   #include "ompi/mca/coll/base/base.h"
>  +#include "ompi/mca/topo/topo.h"
>   #include "coll_basic.h"
>  
>  
>  @@ -70,6 

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-30 Thread Gilles Gouaillardet
Nathan,

here is a revision of the previously attached patch, and that supports
graph and dist graph.

Cheers,

Gilles

On 2014/09/30 0:05, Nathan Hjelm wrote:
> An equivalent change would need to be made for graph and dist graph as
> well. That will take a little more work. Also, I was avoiding changing
> anything in topo for 1.8.
>
> -Nathan
>
> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>>Nathan,
>>
>>why not just make the topology information available at that point as you
>>described it ?
>>
>>the attached patch does this, could you please review it ?
>>
>>Cheers,
>>
>>Gilles
>>
>>On 2014/09/26 2:50, Nathan Hjelm wrote:
>>
>>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>>
>>  I finally managed to track down some issues in mpi4py's test suite
>>  using Open MPI 1.8+. The code below should be enough to reproduce the
>>  problem. Run it under valgrind to make sense of my following
>>  diagnostics.
>>
>>  In this code I'm creating a 2D, periodic Cartesian topology out of
>>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>>  being allocated to manage communication:
>>
>>  if (OMPI_COMM_IS_INTER(comm)) {
>>  size = ompi_comm_remote_size(comm);
>>  } else {
>>  size = ompi_comm_size(comm);
>>  }
>>  basic_module->mccb_num_reqs = size * 2;
>>  basic_module->mccb_reqs = (ompi_request_t**)
>>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>>  I guess you have to also special-case for topologies and allocate
>>  indegree+outdegree requests (not sure about this number, just
>>  guessing).
>>
>>
>>  I wish this was possible but the topology information is not available
>>  at that point. We may be able to change that but I don't see the work
>>  completing anytime soon. I committed an alternative fix as r32796 and
>>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>>  produces a SEGV. Let me know if you run into any more issues.
>>
>>
>>  -Nathan
>>
>>  ___
>>  devel mailing list
>>  de...@open-mpi.org
>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>> ===
>> --- ompi/mca/topo/base/topo_base_cart_create.c   (revision 32807)
>> +++ ompi/mca/topo/base/topo_base_cart_create.c   (working copy)
>> @@ -163,10 +163,18 @@
>>  return MPI_ERR_INTERN;
>>  }
>>  
>> +assert(NULL == new_comm->c_topo);
>> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>> +new_comm->c_topo   = topo;
>> +new_comm->c_topo->mtc.cart = cart;
>> +new_comm->c_topo->reorder  = reorder;
>> +new_comm->c_flags |= OMPI_COMM_CART;
>>  ret = ompi_comm_enable(old_comm, new_comm,
>> new_rank, num_procs, topo_procs);
>>  if (OMPI_SUCCESS != ret) {
>>  /* something wrong happened during setting the communicator */
>> +new_comm->c_topo = NULL;
>> +new_comm->c_flags &= ~OMPI_COMM_CART;
>>  ompi_comm_free (_comm);
>>  free(topo_procs);
>>  if(NULL != cart->periods) free(cart->periods);
>> @@ -176,10 +184,6 @@
>>  return ret;
>>  }
>>  
>> -new_comm->c_topo   = topo;
>> -new_comm->c_topo->mtc.cart = cart;
>> -new_comm->c_topo->reorder  = reorder;
>> -new_comm->c_flags |= OMPI_COMM_CART;
>>  *comm_topo = new_comm;
>>  
>>  if( MPI_UNDEFINED == new_rank ) {
>> Index: ompi/mca/coll/basic/coll_basic_module.c
>> ===
>> --- ompi/mca/coll/basic/coll_basic_module.c  (revision 32807)
>> +++ ompi/mca/coll/basic/coll_basic_module.c  (working copy)
>> @@ -13,6 +13,8 @@
>>   * Copyright (c) 2012  Sandia National Laboratories. All rights 
>> reserved.
>>   * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>>   * reserved.
>> + * Copyright (c) 2014  Research Organization for Information Science
>> + * and Technology (RIST). All rights reserved.
>>   * $COPYRIGHT$
>>   * 
>>   * Additional copyrights may follow
>> @@ -28,6 +30,7 @@
>>  #include "mpi.h"
>>  #include "ompi/mca/coll/coll.h"
>>  #include "ompi/mca/coll/base/base.h"
>> +#include "ompi/mca/topo/topo.h"
>>  #include "coll_basic.h"
>>  
>>  
>> @@ -70,6 +73,15 @@
>>  } else {
>>  size = ompi_comm_size(comm);
>>  }
>> +if (comm->c_flags & OMPI_COMM_CART) {
>> +int cart_size;
>> +assert (NULL != comm->c_topo);
>> +

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-29 Thread Nathan Hjelm
An equivalent change would need to be made for graph and dist graph as
well. That will take a little more work. Also, I was avoiding changing
anything in topo for 1.8.

-Nathan

On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>Nathan,
> 
>why not just make the topology information available at that point as you
>described it ?
> 
>the attached patch does this, could you please review it ?
> 
>Cheers,
> 
>Gilles
> 
>On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
>  I finally managed to track down some issues in mpi4py's test suite
>  using Open MPI 1.8+. The code below should be enough to reproduce the
>  problem. Run it under valgrind to make sense of my following
>  diagnostics.
> 
>  In this code I'm creating a 2D, periodic Cartesian topology out of
>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>  being allocated to manage communication:
> 
>  if (OMPI_COMM_IS_INTER(comm)) {
>  size = ompi_comm_remote_size(comm);
>  } else {
>  size = ompi_comm_size(comm);
>  }
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**)
>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
>  I guess you have to also special-case for topologies and allocate
>  indegree+outdegree requests (not sure about this number, just
>  guessing).
> 
> 
>  I wish this was possible but the topology information is not available
>  at that point. We may be able to change that but I don't see the work
>  completing anytime soon. I committed an alternative fix as r32796 and
>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>  produces a SEGV. Let me know if you run into any more issues.
> 
> 
>  -Nathan
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php

> Index: ompi/mca/topo/base/topo_base_cart_create.c
> ===
> --- ompi/mca/topo/base/topo_base_cart_create.c(revision 32807)
> +++ ompi/mca/topo/base/topo_base_cart_create.c(working copy)
> @@ -163,10 +163,18 @@
>  return MPI_ERR_INTERN;
>  }
>  
> +assert(NULL == new_comm->c_topo);
> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
> +new_comm->c_topo   = topo;
> +new_comm->c_topo->mtc.cart = cart;
> +new_comm->c_topo->reorder  = reorder;
> +new_comm->c_flags |= OMPI_COMM_CART;
>  ret = ompi_comm_enable(old_comm, new_comm,
> new_rank, num_procs, topo_procs);
>  if (OMPI_SUCCESS != ret) {
>  /* something wrong happened during setting the communicator */
> +new_comm->c_topo = NULL;
> +new_comm->c_flags &= ~OMPI_COMM_CART;
>  ompi_comm_free (_comm);
>  free(topo_procs);
>  if(NULL != cart->periods) free(cart->periods);
> @@ -176,10 +184,6 @@
>  return ret;
>  }
>  
> -new_comm->c_topo   = topo;
> -new_comm->c_topo->mtc.cart = cart;
> -new_comm->c_topo->reorder  = reorder;
> -new_comm->c_flags |= OMPI_COMM_CART;
>  *comm_topo = new_comm;
>  
>  if( MPI_UNDEFINED == new_rank ) {
> Index: ompi/mca/coll/basic/coll_basic_module.c
> ===
> --- ompi/mca/coll/basic/coll_basic_module.c   (revision 32807)
> +++ ompi/mca/coll/basic/coll_basic_module.c   (working copy)
> @@ -13,6 +13,8 @@
>   * Copyright (c) 2012  Sandia National Laboratories. All rights reserved.
>   * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>   * reserved.
> + * Copyright (c) 2014  Research Organization for Information Science
> + * and Technology (RIST). All rights reserved.
>   * $COPYRIGHT$
>   * 
>   * Additional copyrights may follow
> @@ -28,6 +30,7 @@
>  #include "mpi.h"
>  #include "ompi/mca/coll/coll.h"
>  #include "ompi/mca/coll/base/base.h"
> +#include "ompi/mca/topo/topo.h"
>  #include "coll_basic.h"
>  
>  
> @@ -70,6 +73,15 @@
>  } else {
>  size = ompi_comm_size(comm);
>  }
> +if (comm->c_flags & OMPI_COMM_CART) {
> +int cart_size;
> +assert (NULL != comm->c_topo);
> +comm->c_topo->topo.cart.cartdim_get(comm, _size);
> +cart_size *= 2;
> +if (cart_size > size) {
> +size = cart_size;
> +}
> +}
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**) 
>  malloc(sizeof(ompi_request_t *) * 

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-29 Thread Gilles Gouaillardet
Nathan,

why not just make the topology information available at that point as
you described it ?

the attached patch does this, could you please review it ?

Cheers,

Gilles

On 2014/09/26 2:50, Nathan Hjelm wrote:
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> I finally managed to track down some issues in mpi4py's test suite
>> using Open MPI 1.8+. The code below should be enough to reproduce the
>> problem. Run it under valgrind to make sense of my following
>> diagnostics.
>>
>> In this code I'm creating a 2D, periodic Cartesian topology out of
>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>> links to itself. So we have size=1 but indegree=outdegree=4. However,
>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>> being allocated to manage communication:
>>
>> if (OMPI_COMM_IS_INTER(comm)) {
>> size = ompi_comm_remote_size(comm);
>> } else {
>> size = ompi_comm_size(comm);
>> }
>> basic_module->mccb_num_reqs = size * 2;
>> basic_module->mccb_reqs = (ompi_request_t**)
>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>> I guess you have to also special-case for topologies and allocate
>> indegree+outdegree requests (not sure about this number, just
>> guessing).
>>
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
>
>
> -Nathan
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php

Index: ompi/mca/topo/base/topo_base_cart_create.c
===
--- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
+++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
@@ -163,10 +163,18 @@
 return MPI_ERR_INTERN;
 }

+assert(NULL == new_comm->c_topo);
+assert(!(new_comm->c_flags & OMPI_COMM_CART));
+new_comm->c_topo   = topo;
+new_comm->c_topo->mtc.cart = cart;
+new_comm->c_topo->reorder  = reorder;
+new_comm->c_flags |= OMPI_COMM_CART;
 ret = ompi_comm_enable(old_comm, new_comm,
new_rank, num_procs, topo_procs);
 if (OMPI_SUCCESS != ret) {
 /* something wrong happened during setting the communicator */
+new_comm->c_topo = NULL;
+new_comm->c_flags &= ~OMPI_COMM_CART;
 ompi_comm_free (_comm);
 free(topo_procs);
 if(NULL != cart->periods) free(cart->periods);
@@ -176,10 +184,6 @@
 return ret;
 }

-new_comm->c_topo   = topo;
-new_comm->c_topo->mtc.cart = cart;
-new_comm->c_topo->reorder  = reorder;
-new_comm->c_flags |= OMPI_COMM_CART;
 *comm_topo = new_comm;

 if( MPI_UNDEFINED == new_rank ) {
Index: ompi/mca/coll/basic/coll_basic_module.c
===
--- ompi/mca/coll/basic/coll_basic_module.c (revision 32807)
+++ ompi/mca/coll/basic/coll_basic_module.c (working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2012  Sandia National Laboratories. All rights reserved.
  * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
  * reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -28,6 +30,7 @@
 #include "mpi.h"
 #include "ompi/mca/coll/coll.h"
 #include "ompi/mca/coll/base/base.h"
+#include "ompi/mca/topo/topo.h"
 #include "coll_basic.h"


@@ -70,6 +73,15 @@
 } else {
 size = ompi_comm_size(comm);
 }
+if (comm->c_flags & OMPI_COMM_CART) {
+int cart_size;
+assert (NULL != comm->c_topo);
+comm->c_topo->topo.cart.cartdim_get(comm, _size);
+cart_size *= 2;
+if (cart_size > size) {
+size = cart_size;
+}
+}
 basic_module->mccb_num_reqs = size * 2;
 basic_module->mccb_reqs = (ompi_request_t**) 
 malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);


Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-28 Thread Lisandro Dalcin
On 25 September 2014 20:50, Nathan Hjelm  wrote:
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> I finally managed to track down some issues in mpi4py's test suite
>> using Open MPI 1.8+. The code below should be enough to reproduce the
>> problem. Run it under valgrind to make sense of my following
>> diagnostics.
>>
>> In this code I'm creating a 2D, periodic Cartesian topology out of
>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>> links to itself. So we have size=1 but indegree=outdegree=4. However,
>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>> being allocated to manage communication:
>>
>> if (OMPI_COMM_IS_INTER(comm)) {
>> size = ompi_comm_remote_size(comm);
>> } else {
>> size = ompi_comm_size(comm);
>> }
>> basic_module->mccb_num_reqs = size * 2;
>> basic_module->mccb_reqs = (ompi_request_t**)
>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>> I guess you have to also special-case for topologies and allocate
>> indegree+outdegree requests (not sure about this number, just
>> guessing).
>>
>
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
>

Did your fix get in for 1.8.3? I'm still getting the segfault.



-- 
Lisandro Dalcin

Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459


Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-08-26 Thread Nathan Hjelm

Good catch. I will take a look and see how best to fix this.

-Nathan

On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> I finally managed to track down some issues in mpi4py's test suite
> using Open MPI 1.8+. The code below should be enough to reproduce the
> problem. Run it under valgrind to make sense of my following
> diagnostics.
> 
> In this code I'm creating a 2D, periodic Cartesian topology out of
> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> links to itself. So we have size=1 but indegree=outdegree=4. However,
> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> being allocated to manage communication:
> 
> if (OMPI_COMM_IS_INTER(comm)) {
> size = ompi_comm_remote_size(comm);
> } else {
> size = ompi_comm_size(comm);
> }
> basic_module->mccb_num_reqs = size * 2;
> basic_module->mccb_reqs = (ompi_request_t**)
> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
> I guess you have to also special-case for topologies and allocate
> indegree+outdegree requests (not sure about this number, just
> guessing).
> 
> 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   MPI_Comm comm;
>   int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1};
>   int sendbuf = 7, recvbuf[5] = {0,0,0,0,0};
>   MPI_Init(, );
> 
>   MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, );
> 
>   MPI_Neighbor_allgather(, 1, MPI_INT,
>  recvbuf,  1, MPI_INT,
>  comm);
> 
>   {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");}
> 
>   MPI_Finalize();
>   return 0;
> }
> 
> 
> -- 
> Lisandro Dalcin
> 
> Research Scientist
> Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
> Numerical Porous Media Center (NumPor)
> King Abdullah University of Science and Technology (KAUST)
> http://numpor.kaust.edu.sa/
> 
> 4700 King Abdullah University of Science and Technology
> al-Khawarizmi Bldg (Bldg 1), Office # 4332
> Thuwal 23955-6900, Kingdom of Saudi Arabia
> http://www.kaust.edu.sa
> 
> Office Phone: +966 12 808-0459
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15713.php


pgp4ELRAwv940.pgp
Description: PGP signature