Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Nathan Hjelm
I should have been clearer. This changes how the topo components are
expected to work. With this change they now MUST provide the c_topo to
the communicator before coll_select. The change is good but may be
unexpected for any vendor topo components out there (which I doubt
there are any).

-Nathan

On Wed, Oct 01, 2014 at 12:52:00PM +, Jeff Squyres (jsquyres) wrote:
> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
>  wrote:
> 
> > oops, i missed graph and dist graph :-(
> > 
> > the attached patch fixes this
> > /* it was straightforward for graph and a bit more challenging for dist
> > graph */
> > 
> > i am fine for Nathan's patch for v1.8
> > /* v1.8 and trunk have already a different way to handle a memory leak */
> > /* the attached patch is vs v1.8 but it is for review only, i will port
> > it to the trunk */
> > 
> > imho, there is no ABI break with this patch.
> > out of curiosity, what is the ABI break that is/will be introduced with
> > the "real" fix ?
> 
> I haven't looked at the code at all, but Nathan said it changed the topo 
> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
> Sorry for the confusions.
> 
> > FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
> > test suite in order to test
> > graph and dist_graph
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > 
> > On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
> >> On the call today, we decided that Nathan's v1.8 patch is sufficient for 
> >> the v1.8 series, and the "real" fix will be applied to the trunk and 
> >> carried forward to the v1.9 series (since it introduces an ABI break for 
> >> the topo framework, which we try not to do in the middle of a release 
> >> series).
> >> 
> >> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
> >> 
> >>> Not quite right. There still is no topology information at collective
> >>> selection time for either graph or dist graph.
> >>> 
> >>> -Nathan
> >>> 
> >>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>   Nathan,
>  
>   here is a revision of the previously attached patch, and that supports
>   graph and dist graph.
>  
>   Cheers,
>  
>   Gilles
>  
>   On 2014/09/30 0:05, Nathan Hjelm wrote:
>  
>  An equivalent change would need to be made for graph and dist graph as
>  well. That will take a little more work. Also, I was avoiding changing
>  anything in topo for 1.8.
>  
>  -Nathan
>  
>  On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>  
>    Nathan,
>  
>    why not just make the topology information available at that point as 
>  you
>    described it ?
>  
>    the attached patch does this, could you please review it ?
>  
>    Cheers,
>  
>    Gilles
>  
>    On 2014/09/26 2:50, Nathan Hjelm wrote:
>  
>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>  
>  I finally managed to track down some issues in mpi4py's test suite
>  using Open MPI 1.8+. The code below should be enough to reproduce the
>  problem. Run it under valgrind to make sense of my following
>  diagnostics.
>  
>  In this code I'm creating a 2D, periodic Cartesian topology out of
>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>  being allocated to manage communication:
>  
>  if (OMPI_COMM_IS_INTER(comm)) {
>  size = ompi_comm_remote_size(comm);
>  } else {
>  size = ompi_comm_size(comm);
>  }
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**)
>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>  
>  I guess you have to also special-case for topologies and allocate
>  indegree+outdegree requests (not sure about this number, just
>  guessing).
>  
>  
>  I wish this was possible but the topology information is not available
>  at that point. We may be able to change that but I don't see the work
>  completing anytime soon. I committed an alternative fix as r32796 and
>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>  produces a SEGV. Let me know if you run into any more issues.
>  
>  
>  -Nathan
>  
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post: 
>  http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>  
>  Index: ompi/mca/topo/base/topo_base_cart_create.c
>  ===
>  --- ompi/mca/topo/base

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Jeff Squyres (jsquyres)
Nathan?

On Oct 1, 2014, at 10:27 AM, George Bosilca  wrote:

> I see no change in the topo interface in any of the patches attached to this 
> thread. Is there any other patch related to this discussion?
> 
>  George.
> 
> 
> 
>> On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres)  wrote:
>> 
>>> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
>>>  wrote:
>>> 
>>> oops, i missed graph and dist graph :-(
>>> 
>>> the attached patch fixes this
>>> /* it was straightforward for graph and a bit more challenging for dist
>>> graph */
>>> 
>>> i am fine for Nathan's patch for v1.8
>>> /* v1.8 and trunk have already a different way to handle a memory leak */
>>> /* the attached patch is vs v1.8 but it is for review only, i will port
>>> it to the trunk */
>>> 
>>> imho, there is no ABI break with this patch.
>>> out of curiosity, what is the ABI break that is/will be introduced with
>>> the "real" fix ?
>> 
>> I haven't looked at the code at all, but Nathan said it changed the topo 
>> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
>> Sorry for the confusions.
>> 
>>> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
>>> test suite in order to test
>>> graph and dist_graph
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
 On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
 On the call today, we decided that Nathan's v1.8 patch is sufficient for 
 the v1.8 series, and the "real" fix will be applied to the trunk and 
 carried forward to the v1.9 series (since it introduces an ABI break for 
 the topo framework, which we try not to do in the middle of a release 
 series).
 
 On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
 
> Not quite right. There still is no topology information at collective
> selection time for either graph or dist graph.
> 
> -Nathan
> 
>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>> Nathan,
>> 
>> here is a revision of the previously attached patch, and that supports
>> graph and dist graph.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/09/30 0:05, Nathan Hjelm wrote:
>> 
>> An equivalent change would need to be made for graph and dist graph as
>> well. That will take a little more work. Also, I was avoiding changing
>> anything in topo for 1.8.
>> 
>> -Nathan
>> 
>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>> 
>> Nathan,
>> 
>> why not just make the topology information available at that point as you
>> described it ?
>> 
>> the attached patch does this, could you please review it ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/09/26 2:50, Nathan Hjelm wrote:
>> 
>> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> 
>> I finally managed to track down some issues in mpi4py's test suite
>> using Open MPI 1.8+. The code below should be enough to reproduce the
>> problem. Run it under valgrind to make sense of my following
>> diagnostics.
>> 
>> In this code I'm creating a 2D, periodic Cartesian topology out of
>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>> links to itself. So we have size=1 but indegree=outdegree=4. However,
>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>> being allocated to manage communication:
>> 
>>   if (OMPI_COMM_IS_INTER(comm)) {
>>   size = ompi_comm_remote_size(comm);
>>   } else {
>>   size = ompi_comm_size(comm);
>>   }
>>   basic_module->mccb_num_reqs = size * 2;
>>   basic_module->mccb_reqs = (ompi_request_t**)
>>   malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>> 
>> I guess you have to also special-case for topologies and allocate
>> indegree+outdegree requests (not sure about this number, just
>> guessing).
>> 
>> 
>> I wish this was possible but the topology information is not available
>> at that point. We may be able to change that but I don't see the work
>> completing anytime soon. I committed an alternative fix as r32796 and
>> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>> produces a SEGV. Let me know if you run into any more issues.
>> 
>> 
>> -Nathan
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>> 
>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>> ===
>> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread George Bosilca
I see no change in the topo interface in any of the patches attached to this 
thread. Is there any other patch related to this discussion?

  George.



> On Oct 1, 2014, at 14:52, Jeff Squyres (jsquyres)  wrote:
> 
>> On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>> oops, i missed graph and dist graph :-(
>> 
>> the attached patch fixes this
>> /* it was straightforward for graph and a bit more challenging for dist
>> graph */
>> 
>> i am fine for Nathan's patch for v1.8
>> /* v1.8 and trunk have already a different way to handle a memory leak */
>> /* the attached patch is vs v1.8 but it is for review only, i will port
>> it to the trunk */
>> 
>> imho, there is no ABI break with this patch.
>> out of curiosity, what is the ABI break that is/will be introduced with
>> the "real" fix ?
> 
> I haven't looked at the code at all, but Nathan said it changed the topo 
> framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
> Sorry for the confusions.
> 
>> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
>> test suite in order to test
>> graph and dist_graph
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>>> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
>>> On the call today, we decided that Nathan's v1.8 patch is sufficient for 
>>> the v1.8 series, and the "real" fix will be applied to the trunk and 
>>> carried forward to the v1.9 series (since it introduces an ABI break for 
>>> the topo framework, which we try not to do in the middle of a release 
>>> series).
>>> 
>>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>>> 
 Not quite right. There still is no topology information at collective
 selection time for either graph or dist graph.
 
 -Nathan
 
> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
> Nathan,
> 
> here is a revision of the previously attached patch, and that supports
> graph and dist graph.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/30 0:05, Nathan Hjelm wrote:
> 
> An equivalent change would need to be made for graph and dist graph as
> well. That will take a little more work. Also, I was avoiding changing
> anything in topo for 1.8.
> 
> -Nathan
> 
> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
> 
>  Nathan,
> 
>  why not just make the topology information available at that point as you
>  described it ?
> 
>  the attached patch does this, could you please review it ?
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
> I finally managed to track down some issues in mpi4py's test suite
> using Open MPI 1.8+. The code below should be enough to reproduce the
> problem. Run it under valgrind to make sense of my following
> diagnostics.
> 
> In this code I'm creating a 2D, periodic Cartesian topology out of
> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> links to itself. So we have size=1 but indegree=outdegree=4. However,
> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> being allocated to manage communication:
> 
>if (OMPI_COMM_IS_INTER(comm)) {
>size = ompi_comm_remote_size(comm);
>} else {
>size = ompi_comm_size(comm);
>}
>basic_module->mccb_num_reqs = size * 2;
>basic_module->mccb_reqs = (ompi_request_t**)
>malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
> I guess you have to also special-case for topologies and allocate
> indegree+outdegree requests (not sure about this number, just
> guessing).
> 
> 
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
> 
> 
> -Nathan
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
> 
> Index: ompi/mca/topo/base/topo_base_cart_create.c
> ===
> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
> @@ -163,10 +163,18 @@
>return MPI_ERR_INTERN;
>}
> 
> +assert(NULL == new_comm->c_topo);
> +assert(!(new_comm->c_flags & OMPI_COMM_CART));

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Jeff Squyres (jsquyres)
On Oct 1, 2014, at 6:48 AM, Gilles Gouaillardet  
wrote:

> oops, i missed graph and dist graph :-(
> 
> the attached patch fixes this
> /* it was straightforward for graph and a bit more challenging for dist
> graph */
> 
> i am fine for Nathan's patch for v1.8
> /* v1.8 and trunk have already a different way to handle a memory leak */
> /* the attached patch is vs v1.8 but it is for review only, i will port
> it to the trunk */
> 
> imho, there is no ABI break with this patch.
> out of curiosity, what is the ABI break that is/will be introduced with
> the "real" fix ?

I haven't looked at the code at all, but Nathan said it changed the topo 
framework interface.  That's the ABI I was referring to -- not the MPI ABI.  
Sorry for the confusions.

> FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
> test suite in order to test
> graph and dist_graph
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
>> On the call today, we decided that Nathan's v1.8 patch is sufficient for the 
>> v1.8 series, and the "real" fix will be applied to the trunk and carried 
>> forward to the v1.9 series (since it introduces an ABI break for the topo 
>> framework, which we try not to do in the middle of a release series).
>> 
>> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>> 
>>> Not quite right. There still is no topology information at collective
>>> selection time for either graph or dist graph.
>>> 
>>> -Nathan
>>> 
>>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
  Nathan,
 
  here is a revision of the previously attached patch, and that supports
  graph and dist graph.
 
  Cheers,
 
  Gilles
 
  On 2014/09/30 0:05, Nathan Hjelm wrote:
 
 An equivalent change would need to be made for graph and dist graph as
 well. That will take a little more work. Also, I was avoiding changing
 anything in topo for 1.8.
 
 -Nathan
 
 On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
 
   Nathan,
 
   why not just make the topology information available at that point as you
   described it ?
 
   the attached patch does this, could you please review it ?
 
   Cheers,
 
   Gilles
 
   On 2014/09/26 2:50, Nathan Hjelm wrote:
 
 On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
 
 I finally managed to track down some issues in mpi4py's test suite
 using Open MPI 1.8+. The code below should be enough to reproduce the
 problem. Run it under valgrind to make sense of my following
 diagnostics.
 
 In this code I'm creating a 2D, periodic Cartesian topology out of
 COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
 links to itself. So we have size=1 but indegree=outdegree=4. However,
 in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
 being allocated to manage communication:
 
 if (OMPI_COMM_IS_INTER(comm)) {
 size = ompi_comm_remote_size(comm);
 } else {
 size = ompi_comm_size(comm);
 }
 basic_module->mccb_num_reqs = size * 2;
 basic_module->mccb_reqs = (ompi_request_t**)
 malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
 
 I guess you have to also special-case for topologies and allocate
 indegree+outdegree requests (not sure about this number, just
 guessing).
 
 
 I wish this was possible but the topology information is not available
 at that point. We may be able to change that but I don't see the work
 completing anytime soon. I committed an alternative fix as r32796 and
 CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
 produces a SEGV. Let me know if you run into any more issues.
 
 
 -Nathan
 
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: 
 http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
 
 Index: ompi/mca/topo/base/topo_base_cart_create.c
 ===
 --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
 +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
 @@ -163,10 +163,18 @@
 return MPI_ERR_INTERN;
 }
 
 +assert(NULL == new_comm->c_topo);
 +assert(!(new_comm->c_flags & OMPI_COMM_CART));
 +new_comm->c_topo   = topo;
 +new_comm->c_topo->mtc.cart = cart;
 +new_comm->c_topo->reorder  = reorder;
 +new_comm->c_flags |= OMPI_COMM_CART;
 ret = ompi_comm_enable(old_comm, new_comm,
new_rank, num_procs, topo_procs);
 if (OMPI_SUCCESS != r

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-10-01 Thread Gilles Gouaillardet
oops, i missed graph and dist graph :-(

the attached patch fixes this
/* it was straightforward for graph and a bit more challenging for dist
graph */

i am fine for Nathan's patch for v1.8
/* v1.8 and trunk have already a different way to handle a memory leak */
/* the attached patch is vs v1.8 but it is for review only, i will port
it to the trunk */

imho, there is no ABI break with this patch.
out of curiosity, what is the ABI break that is/will be introduced with
the "real" fix ?

FYI, i also enhanced the collective/neighbor_allgather_self from the ibm
test suite in order to test
graph and dist_graph

Cheers,

Gilles


On 2014/10/01 0:58, Jeff Squyres (jsquyres) wrote:
> On the call today, we decided that Nathan's v1.8 patch is sufficient for the 
> v1.8 series, and the "real" fix will be applied to the trunk and carried 
> forward to the v1.9 series (since it introduces an ABI break for the topo 
> framework, which we try not to do in the middle of a release series).
>
> On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:
>
>> Not quite right. There still is no topology information at collective
>> selection time for either graph or dist graph.
>>
>> -Nathan
>>
>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>>>   Nathan,
>>>
>>>   here is a revision of the previously attached patch, and that supports
>>>   graph and dist graph.
>>>
>>>   Cheers,
>>>
>>>   Gilles
>>>
>>>   On 2014/09/30 0:05, Nathan Hjelm wrote:
>>>
>>> An equivalent change would need to be made for graph and dist graph as
>>> well. That will take a little more work. Also, I was avoiding changing
>>> anything in topo for 1.8.
>>>
>>> -Nathan
>>>
>>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>>>
>>>Nathan,
>>>
>>>why not just make the topology information available at that point as you
>>>described it ?
>>>
>>>the attached patch does this, could you please review it ?
>>>
>>>Cheers,
>>>
>>>Gilles
>>>
>>>On 2014/09/26 2:50, Nathan Hjelm wrote:
>>>
>>>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>>>
>>>  I finally managed to track down some issues in mpi4py's test suite
>>>  using Open MPI 1.8+. The code below should be enough to reproduce the
>>>  problem. Run it under valgrind to make sense of my following
>>>  diagnostics.
>>>
>>>  In this code I'm creating a 2D, periodic Cartesian topology out of
>>>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>>>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>>>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>>>  being allocated to manage communication:
>>>
>>>  if (OMPI_COMM_IS_INTER(comm)) {
>>>  size = ompi_comm_remote_size(comm);
>>>  } else {
>>>  size = ompi_comm_size(comm);
>>>  }
>>>  basic_module->mccb_num_reqs = size * 2;
>>>  basic_module->mccb_reqs = (ompi_request_t**)
>>>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>>
>>>  I guess you have to also special-case for topologies and allocate
>>>  indegree+outdegree requests (not sure about this number, just
>>>  guessing).
>>>
>>>
>>>  I wish this was possible but the topology information is not available
>>>  at that point. We may be able to change that but I don't see the work
>>>  completing anytime soon. I committed an alternative fix as r32796 and
>>>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>>>  produces a SEGV. Let me know if you run into any more issues.
>>>
>>>
>>>  -Nathan
>>>
>>>  ___
>>>  devel mailing list
>>>  de...@open-mpi.org
>>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>>>
>>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>>> ===
>>> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>>> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
>>> @@ -163,10 +163,18 @@
>>>  return MPI_ERR_INTERN;
>>>  }
>>>
>>> +assert(NULL == new_comm->c_topo);
>>> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>>> +new_comm->c_topo   = topo;
>>> +new_comm->c_topo->mtc.cart = cart;
>>> +new_comm->c_topo->reorder  = reorder;
>>> +new_comm->c_flags |= OMPI_COMM_CART;
>>>  ret = ompi_comm_enable(old_comm, new_comm,
>>> new_rank, num_procs, topo_procs);
>>>  if (OMPI_SUCCESS != ret) {
>>>  /* something wrong happened during setting the communicator */
>>> +new_comm->c_topo = NULL;
>>> +new_comm->c_flags &= ~OMPI_COMM_CART;
>>>  ompi_comm_free (&new_comm);
>>>  free(topo_procs);
>>>  if(NULL != cart->periods) free(cart->periods);
>>> @@ -176,10 +184,6 @@
>>>  return ret;
>>>  }
>>>
>

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-30 Thread Jeff Squyres (jsquyres)
On the call today, we decided that Nathan's v1.8 patch is sufficient for the 
v1.8 series, and the "real" fix will be applied to the trunk and carried 
forward to the v1.9 series (since it introduces an ABI break for the topo 
framework, which we try not to do in the middle of a release series).

On Sep 30, 2014, at 10:40 AM, Nathan Hjelm  wrote:

> 
> Not quite right. There still is no topology information at collective
> selection time for either graph or dist graph.
> 
> -Nathan
> 
> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>>   Nathan,
>> 
>>   here is a revision of the previously attached patch, and that supports
>>   graph and dist graph.
>> 
>>   Cheers,
>> 
>>   Gilles
>> 
>>   On 2014/09/30 0:05, Nathan Hjelm wrote:
>> 
>> An equivalent change would need to be made for graph and dist graph as
>> well. That will take a little more work. Also, I was avoiding changing
>> anything in topo for 1.8.
>> 
>> -Nathan
>> 
>> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>> 
>>Nathan,
>> 
>>why not just make the topology information available at that point as you
>>described it ?
>> 
>>the attached patch does this, could you please review it ?
>> 
>>Cheers,
>> 
>>Gilles
>> 
>>On 2014/09/26 2:50, Nathan Hjelm wrote:
>> 
>>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> 
>>  I finally managed to track down some issues in mpi4py's test suite
>>  using Open MPI 1.8+. The code below should be enough to reproduce the
>>  problem. Run it under valgrind to make sense of my following
>>  diagnostics.
>> 
>>  In this code I'm creating a 2D, periodic Cartesian topology out of
>>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>>  being allocated to manage communication:
>> 
>>  if (OMPI_COMM_IS_INTER(comm)) {
>>  size = ompi_comm_remote_size(comm);
>>  } else {
>>  size = ompi_comm_size(comm);
>>  }
>>  basic_module->mccb_num_reqs = size * 2;
>>  basic_module->mccb_reqs = (ompi_request_t**)
>>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>> 
>>  I guess you have to also special-case for topologies and allocate
>>  indegree+outdegree requests (not sure about this number, just
>>  guessing).
>> 
>> 
>>  I wish this was possible but the topology information is not available
>>  at that point. We may be able to change that but I don't see the work
>>  completing anytime soon. I committed an alternative fix as r32796 and
>>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>>  produces a SEGV. Let me know if you run into any more issues.
>> 
>> 
>>  -Nathan
>> 
>>  ___
>>  devel mailing list
>>  de...@open-mpi.org
>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>> 
>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>> ===
>> --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>> +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
>> @@ -163,10 +163,18 @@
>>  return MPI_ERR_INTERN;
>>  }
>> 
>> +assert(NULL == new_comm->c_topo);
>> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>> +new_comm->c_topo   = topo;
>> +new_comm->c_topo->mtc.cart = cart;
>> +new_comm->c_topo->reorder  = reorder;
>> +new_comm->c_flags |= OMPI_COMM_CART;
>>  ret = ompi_comm_enable(old_comm, new_comm,
>> new_rank, num_procs, topo_procs);
>>  if (OMPI_SUCCESS != ret) {
>>  /* something wrong happened during setting the communicator */
>> +new_comm->c_topo = NULL;
>> +new_comm->c_flags &= ~OMPI_COMM_CART;
>>  ompi_comm_free (&new_comm);
>>  free(topo_procs);
>>  if(NULL != cart->periods) free(cart->periods);
>> @@ -176,10 +184,6 @@
>>  return ret;
>>  }
>> 
>> -new_comm->c_topo   = topo;
>> -new_comm->c_topo->mtc.cart = cart;
>> -new_comm->c_topo->reorder  = reorder;
>> -new_comm->c_flags |= OMPI_COMM_CART;
>>  *comm_topo = new_comm;
>> 
>>  if( MPI_UNDEFINED == new_rank ) {
>> Index: ompi/mca/coll/basic/coll_basic_module.c
>> ===
>> --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807)
>> +++ ompi/mca/coll/basic/coll_basic_module.c (working copy)
>> @@ -13,6 +13,8 @@
>>   * Copyright (c) 2012  Sandia National Laboratories. All rights 
>> reserved.
>>   * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>>   * reserved.
>> + * Copyright (c) 2014  R

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-30 Thread Nathan Hjelm

Not quite right. There still is no topology information at collective
selection time for either graph or dist graph.

-Nathan

On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>Nathan,
> 
>here is a revision of the previously attached patch, and that supports
>graph and dist graph.
> 
>Cheers,
> 
>Gilles
> 
>On 2014/09/30 0:05, Nathan Hjelm wrote:
> 
>  An equivalent change would need to be made for graph and dist graph as
>  well. That will take a little more work. Also, I was avoiding changing
>  anything in topo for 1.8.
> 
>  -Nathan
> 
>  On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
> 
> Nathan,
> 
> why not just make the topology information available at that point as you
> described it ?
> 
> the attached patch does this, could you please review it ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
>   On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
>   I finally managed to track down some issues in mpi4py's test suite
>   using Open MPI 1.8+. The code below should be enough to reproduce the
>   problem. Run it under valgrind to make sense of my following
>   diagnostics.
> 
>   In this code I'm creating a 2D, periodic Cartesian topology out of
>   COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>   links to itself. So we have size=1 but indegree=outdegree=4. However,
>   in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>   being allocated to manage communication:
> 
>   if (OMPI_COMM_IS_INTER(comm)) {
>   size = ompi_comm_remote_size(comm);
>   } else {
>   size = ompi_comm_size(comm);
>   }
>   basic_module->mccb_num_reqs = size * 2;
>   basic_module->mccb_reqs = (ompi_request_t**)
>   malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
>   I guess you have to also special-case for topologies and allocate
>   indegree+outdegree requests (not sure about this number, just
>   guessing).
> 
> 
>   I wish this was possible but the topology information is not available
>   at that point. We may be able to change that but I don't see the work
>   completing anytime soon. I committed an alternative fix as r32796 and
>   CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>   produces a SEGV. Let me know if you run into any more issues.
> 
> 
>   -Nathan
> 
>   ___
>   devel mailing list
>   de...@open-mpi.org
>   Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>   Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
> 
>  Index: ompi/mca/topo/base/topo_base_cart_create.c
>  ===
>  --- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
>  +++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
>  @@ -163,10 +163,18 @@
>   return MPI_ERR_INTERN;
>   }
>  
>  +assert(NULL == new_comm->c_topo);
>  +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>  +new_comm->c_topo   = topo;
>  +new_comm->c_topo->mtc.cart = cart;
>  +new_comm->c_topo->reorder  = reorder;
>  +new_comm->c_flags |= OMPI_COMM_CART;
>   ret = ompi_comm_enable(old_comm, new_comm,
>  new_rank, num_procs, topo_procs);
>   if (OMPI_SUCCESS != ret) {
>   /* something wrong happened during setting the communicator */
>  +new_comm->c_topo = NULL;
>  +new_comm->c_flags &= ~OMPI_COMM_CART;
>   ompi_comm_free (&new_comm);
>   free(topo_procs);
>   if(NULL != cart->periods) free(cart->periods);
>  @@ -176,10 +184,6 @@
>   return ret;
>   }
>  
>  -new_comm->c_topo   = topo;
>  -new_comm->c_topo->mtc.cart = cart;
>  -new_comm->c_topo->reorder  = reorder;
>  -new_comm->c_flags |= OMPI_COMM_CART;
>   *comm_topo = new_comm;
>  
>   if( MPI_UNDEFINED == new_rank ) {
>  Index: ompi/mca/coll/basic/coll_basic_module.c
>  ===
>  --- ompi/mca/coll/basic/coll_basic_module.c (revision 32807)
>  +++ ompi/mca/coll/basic/coll_basic_module.c (working copy)
>  @@ -13,6 +13,8 @@
>* Copyright (c) 2012  Sandia National Laboratories. All rights 
> reserved.
>* Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>* reserved.
>  + * Copyright (c) 2014  Research Organization for Information Science
>  + * and Technology (RIST). All rights reserved.
>* $COPYRIGHT$
>*
>* Additional copyrights may follow
>  @@ -28,6 +30,7 @@
>   #include "mpi.h"
>   #include "ompi/mca/coll/coll.h"
>   #include "ompi/mca/coll/base/base.h"
>  +#include "ompi/mca/topo/topo.h"
>   #include "coll_basic.h"
>  
>  
>  @@ -70,6 +

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-30 Thread Gilles Gouaillardet
Nathan,

here is a revision of the previously attached patch, and that supports
graph and dist graph.

Cheers,

Gilles

On 2014/09/30 0:05, Nathan Hjelm wrote:
> An equivalent change would need to be made for graph and dist graph as
> well. That will take a little more work. Also, I was avoiding changing
> anything in topo for 1.8.
>
> -Nathan
>
> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>>Nathan,
>>
>>why not just make the topology information available at that point as you
>>described it ?
>>
>>the attached patch does this, could you please review it ?
>>
>>Cheers,
>>
>>Gilles
>>
>>On 2014/09/26 2:50, Nathan Hjelm wrote:
>>
>>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>>
>>  I finally managed to track down some issues in mpi4py's test suite
>>  using Open MPI 1.8+. The code below should be enough to reproduce the
>>  problem. Run it under valgrind to make sense of my following
>>  diagnostics.
>>
>>  In this code I'm creating a 2D, periodic Cartesian topology out of
>>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>>  being allocated to manage communication:
>>
>>  if (OMPI_COMM_IS_INTER(comm)) {
>>  size = ompi_comm_remote_size(comm);
>>  } else {
>>  size = ompi_comm_size(comm);
>>  }
>>  basic_module->mccb_num_reqs = size * 2;
>>  basic_module->mccb_reqs = (ompi_request_t**)
>>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>>  I guess you have to also special-case for topologies and allocate
>>  indegree+outdegree requests (not sure about this number, just
>>  guessing).
>>
>>
>>  I wish this was possible but the topology information is not available
>>  at that point. We may be able to change that but I don't see the work
>>  completing anytime soon. I committed an alternative fix as r32796 and
>>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>>  produces a SEGV. Let me know if you run into any more issues.
>>
>>
>>  -Nathan
>>
>>  ___
>>  devel mailing list
>>  de...@open-mpi.org
>>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php
>> Index: ompi/mca/topo/base/topo_base_cart_create.c
>> ===
>> --- ompi/mca/topo/base/topo_base_cart_create.c   (revision 32807)
>> +++ ompi/mca/topo/base/topo_base_cart_create.c   (working copy)
>> @@ -163,10 +163,18 @@
>>  return MPI_ERR_INTERN;
>>  }
>>  
>> +assert(NULL == new_comm->c_topo);
>> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
>> +new_comm->c_topo   = topo;
>> +new_comm->c_topo->mtc.cart = cart;
>> +new_comm->c_topo->reorder  = reorder;
>> +new_comm->c_flags |= OMPI_COMM_CART;
>>  ret = ompi_comm_enable(old_comm, new_comm,
>> new_rank, num_procs, topo_procs);
>>  if (OMPI_SUCCESS != ret) {
>>  /* something wrong happened during setting the communicator */
>> +new_comm->c_topo = NULL;
>> +new_comm->c_flags &= ~OMPI_COMM_CART;
>>  ompi_comm_free (&new_comm);
>>  free(topo_procs);
>>  if(NULL != cart->periods) free(cart->periods);
>> @@ -176,10 +184,6 @@
>>  return ret;
>>  }
>>  
>> -new_comm->c_topo   = topo;
>> -new_comm->c_topo->mtc.cart = cart;
>> -new_comm->c_topo->reorder  = reorder;
>> -new_comm->c_flags |= OMPI_COMM_CART;
>>  *comm_topo = new_comm;
>>  
>>  if( MPI_UNDEFINED == new_rank ) {
>> Index: ompi/mca/coll/basic/coll_basic_module.c
>> ===
>> --- ompi/mca/coll/basic/coll_basic_module.c  (revision 32807)
>> +++ ompi/mca/coll/basic/coll_basic_module.c  (working copy)
>> @@ -13,6 +13,8 @@
>>   * Copyright (c) 2012  Sandia National Laboratories. All rights 
>> reserved.
>>   * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>>   * reserved.
>> + * Copyright (c) 2014  Research Organization for Information Science
>> + * and Technology (RIST). All rights reserved.
>>   * $COPYRIGHT$
>>   * 
>>   * Additional copyrights may follow
>> @@ -28,6 +30,7 @@
>>  #include "mpi.h"
>>  #include "ompi/mca/coll/coll.h"
>>  #include "ompi/mca/coll/base/base.h"
>> +#include "ompi/mca/topo/topo.h"
>>  #include "coll_basic.h"
>>  
>>  
>> @@ -70,6 +73,15 @@
>>  } else {
>>  size = ompi_comm_size(comm);
>>  }
>> +if (comm->c_flags & OMPI_COMM_CART) {
>> +int cart_size;
>> +assert (NULL != comm->c_topo);
>> +comm->c_topo->topo.cart.cartdim_ge

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-29 Thread Nathan Hjelm
An equivalent change would need to be made for graph and dist graph as
well. That will take a little more work. Also, I was avoiding changing
anything in topo for 1.8.

-Nathan

On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>Nathan,
> 
>why not just make the topology information available at that point as you
>described it ?
> 
>the attached patch does this, could you please review it ?
> 
>Cheers,
> 
>Gilles
> 
>On 2014/09/26 2:50, Nathan Hjelm wrote:
> 
>  On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> 
>  I finally managed to track down some issues in mpi4py's test suite
>  using Open MPI 1.8+. The code below should be enough to reproduce the
>  problem. Run it under valgrind to make sense of my following
>  diagnostics.
> 
>  In this code I'm creating a 2D, periodic Cartesian topology out of
>  COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>  links to itself. So we have size=1 but indegree=outdegree=4. However,
>  in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>  being allocated to manage communication:
> 
>  if (OMPI_COMM_IS_INTER(comm)) {
>  size = ompi_comm_remote_size(comm);
>  } else {
>  size = ompi_comm_size(comm);
>  }
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**)
>  malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
>  I guess you have to also special-case for topologies and allocate
>  indegree+outdegree requests (not sure about this number, just
>  guessing).
> 
> 
>  I wish this was possible but the topology information is not available
>  at that point. We may be able to change that but I don't see the work
>  completing anytime soon. I committed an alternative fix as r32796 and
>  CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
>  produces a SEGV. Let me know if you run into any more issues.
> 
> 
>  -Nathan
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php

> Index: ompi/mca/topo/base/topo_base_cart_create.c
> ===
> --- ompi/mca/topo/base/topo_base_cart_create.c(revision 32807)
> +++ ompi/mca/topo/base/topo_base_cart_create.c(working copy)
> @@ -163,10 +163,18 @@
>  return MPI_ERR_INTERN;
>  }
>  
> +assert(NULL == new_comm->c_topo);
> +assert(!(new_comm->c_flags & OMPI_COMM_CART));
> +new_comm->c_topo   = topo;
> +new_comm->c_topo->mtc.cart = cart;
> +new_comm->c_topo->reorder  = reorder;
> +new_comm->c_flags |= OMPI_COMM_CART;
>  ret = ompi_comm_enable(old_comm, new_comm,
> new_rank, num_procs, topo_procs);
>  if (OMPI_SUCCESS != ret) {
>  /* something wrong happened during setting the communicator */
> +new_comm->c_topo = NULL;
> +new_comm->c_flags &= ~OMPI_COMM_CART;
>  ompi_comm_free (&new_comm);
>  free(topo_procs);
>  if(NULL != cart->periods) free(cart->periods);
> @@ -176,10 +184,6 @@
>  return ret;
>  }
>  
> -new_comm->c_topo   = topo;
> -new_comm->c_topo->mtc.cart = cart;
> -new_comm->c_topo->reorder  = reorder;
> -new_comm->c_flags |= OMPI_COMM_CART;
>  *comm_topo = new_comm;
>  
>  if( MPI_UNDEFINED == new_rank ) {
> Index: ompi/mca/coll/basic/coll_basic_module.c
> ===
> --- ompi/mca/coll/basic/coll_basic_module.c   (revision 32807)
> +++ ompi/mca/coll/basic/coll_basic_module.c   (working copy)
> @@ -13,6 +13,8 @@
>   * Copyright (c) 2012  Sandia National Laboratories. All rights reserved.
>   * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
>   * reserved.
> + * Copyright (c) 2014  Research Organization for Information Science
> + * and Technology (RIST). All rights reserved.
>   * $COPYRIGHT$
>   * 
>   * Additional copyrights may follow
> @@ -28,6 +30,7 @@
>  #include "mpi.h"
>  #include "ompi/mca/coll/coll.h"
>  #include "ompi/mca/coll/base/base.h"
> +#include "ompi/mca/topo/topo.h"
>  #include "coll_basic.h"
>  
>  
> @@ -70,6 +73,15 @@
>  } else {
>  size = ompi_comm_size(comm);
>  }
> +if (comm->c_flags & OMPI_COMM_CART) {
> +int cart_size;
> +assert (NULL != comm->c_topo);
> +comm->c_topo->topo.cart.cartdim_get(comm, &cart_size);
> +cart_size *= 2;
> +if (cart_size > size) {
> +size = cart_size;
> +}
> +}
>  basic_module->mccb_num_reqs = size * 2;
>  basic_module->mccb_reqs = (ompi_request_t**) 
>  malloc(sizeof(ompi_reques

Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-29 Thread Gilles Gouaillardet
Nathan,

why not just make the topology information available at that point as
you described it ?

the attached patch does this, could you please review it ?

Cheers,

Gilles

On 2014/09/26 2:50, Nathan Hjelm wrote:
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> I finally managed to track down some issues in mpi4py's test suite
>> using Open MPI 1.8+. The code below should be enough to reproduce the
>> problem. Run it under valgrind to make sense of my following
>> diagnostics.
>>
>> In this code I'm creating a 2D, periodic Cartesian topology out of
>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>> links to itself. So we have size=1 but indegree=outdegree=4. However,
>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>> being allocated to manage communication:
>>
>> if (OMPI_COMM_IS_INTER(comm)) {
>> size = ompi_comm_remote_size(comm);
>> } else {
>> size = ompi_comm_size(comm);
>> }
>> basic_module->mccb_num_reqs = size * 2;
>> basic_module->mccb_reqs = (ompi_request_t**)
>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>> I guess you have to also special-case for topologies and allocate
>> indegree+outdegree requests (not sure about this number, just
>> guessing).
>>
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
>
>
> -Nathan
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15915.php

Index: ompi/mca/topo/base/topo_base_cart_create.c
===
--- ompi/mca/topo/base/topo_base_cart_create.c  (revision 32807)
+++ ompi/mca/topo/base/topo_base_cart_create.c  (working copy)
@@ -163,10 +163,18 @@
 return MPI_ERR_INTERN;
 }

+assert(NULL == new_comm->c_topo);
+assert(!(new_comm->c_flags & OMPI_COMM_CART));
+new_comm->c_topo   = topo;
+new_comm->c_topo->mtc.cart = cart;
+new_comm->c_topo->reorder  = reorder;
+new_comm->c_flags |= OMPI_COMM_CART;
 ret = ompi_comm_enable(old_comm, new_comm,
new_rank, num_procs, topo_procs);
 if (OMPI_SUCCESS != ret) {
 /* something wrong happened during setting the communicator */
+new_comm->c_topo = NULL;
+new_comm->c_flags &= ~OMPI_COMM_CART;
 ompi_comm_free (&new_comm);
 free(topo_procs);
 if(NULL != cart->periods) free(cart->periods);
@@ -176,10 +184,6 @@
 return ret;
 }

-new_comm->c_topo   = topo;
-new_comm->c_topo->mtc.cart = cart;
-new_comm->c_topo->reorder  = reorder;
-new_comm->c_flags |= OMPI_COMM_CART;
 *comm_topo = new_comm;

 if( MPI_UNDEFINED == new_rank ) {
Index: ompi/mca/coll/basic/coll_basic_module.c
===
--- ompi/mca/coll/basic/coll_basic_module.c (revision 32807)
+++ ompi/mca/coll/basic/coll_basic_module.c (working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2012  Sandia National Laboratories. All rights reserved.
  * Copyright (c) 2013  Los Alamos National Security, LLC. All rights
  * reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -28,6 +30,7 @@
 #include "mpi.h"
 #include "ompi/mca/coll/coll.h"
 #include "ompi/mca/coll/base/base.h"
+#include "ompi/mca/topo/topo.h"
 #include "coll_basic.h"


@@ -70,6 +73,15 @@
 } else {
 size = ompi_comm_size(comm);
 }
+if (comm->c_flags & OMPI_COMM_CART) {
+int cart_size;
+assert (NULL != comm->c_topo);
+comm->c_topo->topo.cart.cartdim_get(comm, &cart_size);
+cart_size *= 2;
+if (cart_size > size) {
+size = cart_size;
+}
+}
 basic_module->mccb_num_reqs = size * 2;
 basic_module->mccb_reqs = (ompi_request_t**) 
 malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);


Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-28 Thread Lisandro Dalcin
On 25 September 2014 20:50, Nathan Hjelm  wrote:
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> I finally managed to track down some issues in mpi4py's test suite
>> using Open MPI 1.8+. The code below should be enough to reproduce the
>> problem. Run it under valgrind to make sense of my following
>> diagnostics.
>>
>> In this code I'm creating a 2D, periodic Cartesian topology out of
>> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
>> links to itself. So we have size=1 but indegree=outdegree=4. However,
>> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
>> being allocated to manage communication:
>>
>> if (OMPI_COMM_IS_INTER(comm)) {
>> size = ompi_comm_remote_size(comm);
>> } else {
>> size = ompi_comm_size(comm);
>> }
>> basic_module->mccb_num_reqs = size * 2;
>> basic_module->mccb_reqs = (ompi_request_t**)
>> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
>>
>> I guess you have to also special-case for topologies and allocate
>> indegree+outdegree requests (not sure about this number, just
>> guessing).
>>
>
> I wish this was possible but the topology information is not available
> at that point. We may be able to change that but I don't see the work
> completing anytime soon. I committed an alternative fix as r32796 and
> CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
> produces a SEGV. Let me know if you run into any more issues.
>

Did your fix get in for 1.8.3? I'm still getting the segfault.



-- 
Lisandro Dalcin

Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459


Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-09-25 Thread Nathan Hjelm
On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> I finally managed to track down some issues in mpi4py's test suite
> using Open MPI 1.8+. The code below should be enough to reproduce the
> problem. Run it under valgrind to make sense of my following
> diagnostics.
> 
> In this code I'm creating a 2D, periodic Cartesian topology out of
> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> links to itself. So we have size=1 but indegree=outdegree=4. However,
> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> being allocated to manage communication:
> 
> if (OMPI_COMM_IS_INTER(comm)) {
> size = ompi_comm_remote_size(comm);
> } else {
> size = ompi_comm_size(comm);
> }
> basic_module->mccb_num_reqs = size * 2;
> basic_module->mccb_reqs = (ompi_request_t**)
> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
> I guess you have to also special-case for topologies and allocate
> indegree+outdegree requests (not sure about this number, just
> guessing).
>

I wish this was possible but the topology information is not available
at that point. We may be able to change that but I don't see the work
completing anytime soon. I committed an alternative fix as r32796 and
CMR'd it to 1.8.3. I can confirm that the attached reproducer no longer
produces a SEGV. Let me know if you run into any more issues.


-Nathan


pgpiboDbxhbSj.pgp
Description: PGP signature


Re: [OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-08-26 Thread Nathan Hjelm

Good catch. I will take a look and see how best to fix this.

-Nathan

On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
> I finally managed to track down some issues in mpi4py's test suite
> using Open MPI 1.8+. The code below should be enough to reproduce the
> problem. Run it under valgrind to make sense of my following
> diagnostics.
> 
> In this code I'm creating a 2D, periodic Cartesian topology out of
> COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
> links to itself. So we have size=1 but indegree=outdegree=4. However,
> in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
> being allocated to manage communication:
> 
> if (OMPI_COMM_IS_INTER(comm)) {
> size = ompi_comm_remote_size(comm);
> } else {
> size = ompi_comm_size(comm);
> }
> basic_module->mccb_num_reqs = size * 2;
> basic_module->mccb_reqs = (ompi_request_t**)
> malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);
> 
> I guess you have to also special-case for topologies and allocate
> indegree+outdegree requests (not sure about this number, just
> guessing).
> 
> 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   MPI_Comm comm;
>   int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1};
>   int sendbuf = 7, recvbuf[5] = {0,0,0,0,0};
>   MPI_Init(&argc, &argv);
> 
>   MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, &comm);
> 
>   MPI_Neighbor_allgather(&sendbuf, 1, MPI_INT,
>  recvbuf,  1, MPI_INT,
>  comm);
> 
>   {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");}
> 
>   MPI_Finalize();
>   return 0;
> }
> 
> 
> -- 
> Lisandro Dalcin
> 
> Research Scientist
> Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
> Numerical Porous Media Center (NumPor)
> King Abdullah University of Science and Technology (KAUST)
> http://numpor.kaust.edu.sa/
> 
> 4700 King Abdullah University of Science and Technology
> al-Khawarizmi Bldg (Bldg 1), Office # 4332
> Thuwal 23955-6900, Kingdom of Saudi Arabia
> http://www.kaust.edu.sa
> 
> Office Phone: +966 12 808-0459
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15713.php


pgp4ELRAwv940.pgp
Description: PGP signature


[OMPI devel] Neighbor collectives with periodic Cartesian topologies of size one

2014-08-26 Thread Lisandro Dalcin
I finally managed to track down some issues in mpi4py's test suite
using Open MPI 1.8+. The code below should be enough to reproduce the
problem. Run it under valgrind to make sense of my following
diagnostics.

In this code I'm creating a 2D, periodic Cartesian topology out of
COMM_SELF. In this case, the process in COMM_SELF has 4 logical in/out
links to itself. So we have size=1 but indegree=outdegree=4. However,
in ompi/mca/coll/basic/coll_basic_module.c, "size * 2" request are
being allocated to manage communication:

if (OMPI_COMM_IS_INTER(comm)) {
size = ompi_comm_remote_size(comm);
} else {
size = ompi_comm_size(comm);
}
basic_module->mccb_num_reqs = size * 2;
basic_module->mccb_reqs = (ompi_request_t**)
malloc(sizeof(ompi_request_t *) * basic_module->mccb_num_reqs);

I guess you have to also special-case for topologies and allocate
indegree+outdegree requests (not sure about this number, just
guessing).


#include 
#include 

int main(int argc, char *argv[])
{
  MPI_Comm comm;
  int ndims = 2, dims[2] = {1,1}, periods[2] = {1,1};
  int sendbuf = 7, recvbuf[5] = {0,0,0,0,0};
  MPI_Init(&argc, &argv);

  MPI_Cart_create(MPI_COMM_SELF, ndims, dims, periods, 0, &comm);

  MPI_Neighbor_allgather(&sendbuf, 1, MPI_INT,
 recvbuf,  1, MPI_INT,
 comm);

  {int i; for (i=0;i<5;i++) printf("%d ",recvbuf[i]); printf("\n");}

  MPI_Finalize();
  return 0;
}


-- 
Lisandro Dalcin

Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459