Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
r20275 looks good.  I suggest that we CMR that into 1.3 and get rc6 rolled
and tested. (actually, Jeff just did the CMR...so off to rc6)
--brad


On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabriel  wrote:

> so I am not entirely sure why the bug only happened on trunk, it could in
> theory also appear on v1.3 (is there a difference on how pointer_arrays are
> handled between the two versions?)
>
> Anyway, it passes now on both with changeset 20275. We should probably move
> that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to
> others to decide...
>
> Thanks
> Edgar
>
>
> Edgar Gabriel wrote:
>
>> I'm already debugging it. the good news is that it only seems to appear
>> with trunk, with 1.3 (after copying the new tuned module over), all the
>> tests pass.
>>
>> Now if somebody can tell me a trick on how to tell mpirun not kill the
>> debugger under my feet, then I could even see where the problem occurs:-)
>>
>> Thanks
>> Edga
>>
>> George Bosilca wrote:
>>
>>> All these errors are in the MPI_Finalize, it should not be that hard to
>>> find. I'll take a look later this afternoon.
>>>
>>>  george.
>>>
>>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>>
>>>  Unfortunately, although this fixed some problems when enabling hierarch
 coll,
 there is still a segfault in two of IU's tests that only shows up when
 we set
 -mca coll_hierarch_priority 100

 See this MTT summary to see how the failures improved on the trunk,
 but that there are still two that segfault even at 1.4a1r20267:
 http://www.open-mpi.org/mtt/index.php?do_redir=923

 This link just has the remaining failures:
 http://www.open-mpi.org/mtt/index.php?do_redir=922

 So, I'll vote for applying the CMR for 1.3 since it clearly improved
 things,
 but there is still more to be done to get coll_hierarch ready for
 regular
 use.

 On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
> george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>  Let's debate tomorrow when people are around, but first you have to
>> file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>  Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective
>>> functions
>>> and changed all instances to use the correct module information. It
>>> is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit
>>> tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow
>>> people
>>> to use the hierarch. In the current incarnation 1.3 will
>>> mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken
>>> toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
>>>  Thanks for digging into this.  Can you file a bug?  Let's mark it
 for
 v1.3.1.

 I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
 and
 since hierarch isn't currently selected by default (you must
 specifically
 elevate hierarch's priority to get it to run), there's no danger
 that users
 will run into this problem in default runs.

 But clearly the problem needs to be fixed, and therefore we need a
 bug
 to track it.



 On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

  I just debugged the Reduce_scatter bug mentioned previously. The
> bug is
> unfortunately not in hierarch, but in tuned.
>
> Here is the code snipplet causing the problems:
>
> int reduce_scatter (, mca_coll_base_module_t *module)
> {
> ...
> err = comm->c_coll.coll_reduce (, module)
> ...
> }
>
>
> but should be
> {
> ...
> err = comm->c_coll.coll_reduce (...,
> comm->c_coll.coll_reduce_module);
> ...
> }
>
> The problem as it is right now is, that when using hierarch, only a
> subset of the function are set, e.g. reduce,allreduce, bcast and
> barrier.
> Thus, reduce_scatter is from tuned in most scenarios, and calls the
> subsequent functions with the wrong module. Hierarch of course does
> not like
> that :-)
>
> Anyway, a quick glance through the tuned code reveals a significant
> number of 

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Edgar Gabriel
so I am not entirely sure why the bug only happened on trunk, it could 
in theory also appear on v1.3 (is there a difference on how 
pointer_arrays are handled between the two versions?)


Anyway, it passes now on both with changeset 20275. We should probably 
move that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that 
up to others to decide...


Thanks
Edgar

Edgar Gabriel wrote:
I'm already debugging it. the good news is that it only seems to appear 
with trunk, with 1.3 (after copying the new tuned module over), all the 
tests pass.


Now if somebody can tell me a trick on how to tell mpirun not kill the 
debugger under my feet, then I could even see where the problem occurs:-)


Thanks
Edga

George Bosilca wrote:
All these errors are in the MPI_Finalize, it should not be that hard 
to find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling 
hierarch coll,
there is still a segfault in two of IU's tests that only shows up 
when we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved 
things,
but there is still more to be done to get coll_hierarch ready for 
regular

use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to 
file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:


Unfortunately, this pinpoint the fact that we didn't test enough the
collective module mixing thing. I went over the tuned collective 
functions
and changed all instances to use the correct module information. 
It is now

on the trunk, revision 20267. Simultaneously,I checked that all other
collective components do the right thing ... and I have to admit 
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow 
people
to use the hierarch. In the current incarnation 1.3 will 
mostly/always
segfault when hierarch is active. I would prefer not to give a 
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it 
for

v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects 
hierarch, and
since hierarch isn't currently selected by default (you must 
specifically
elevate hierarch's priority to get it to run), there's no danger 
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need 
a bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The 
bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., 
comm->c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and 
barrier.

Thus, reduce_scatter is from tuned in most scenarios, and calls the
subsequent functions with the wrong module. Hierarch of course 
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, 
allreduce, allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly 
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
So, if it looks okay on 1.3...then there should not be anything holding up
the release, right?  Otherwise, George we need to decide on whether or not
this is a blocker, or if we go ahead and release with this as a known issue
and schedule the fix for 1.3.1.  My vote is to go ahead and release, but if
you (or others) think otherwise, let's talk about how best to move forward.
--brad


On Wed, Jan 14, 2009 at 12:04 PM, Edgar Gabriel  wrote:

> I'm already debugging it. the good news is that it only seems to appear
> with trunk, with 1.3 (after copying the new tuned module over), all the
> tests pass.
>
> Now if somebody can tell me a trick on how to tell mpirun not kill the
> debugger under my feet, then I could even see where the problem occurs:-)
>
> Thanks
> Edga
>
>
> George Bosilca wrote:
>
>> All these errors are in the MPI_Finalize, it should not be that hard to
>> find. I'll take a look later this afternoon.
>>
>>  george.
>>
>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>
>>  Unfortunately, although this fixed some problems when enabling hierarch
>>> coll,
>>> there is still a segfault in two of IU's tests that only shows up when we
>>> set
>>> -mca coll_hierarch_priority 100
>>>
>>> See this MTT summary to see how the failures improved on the trunk,
>>> but that there are still two that segfault even at 1.4a1r20267:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>
>>> This link just has the remaining failures:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>
>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved
>>> things,
>>> but there is still more to be done to get coll_hierarch ready for regular
>>> use.
>>>
>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
>>> wrote:
>>>
 Here we go by the book :)

 https://svn.open-mpi.org/trac/ompi/ticket/1749

 george.

 On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

  Let's debate tomorrow when people are around, but first you have to
> file a
> CMR... :-)
>
> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>
>  Unfortunately, this pinpoint the fact that we didn't test enough the
>> collective module mixing thing. I went over the tuned collective
>> functions
>> and changed all instances to use the correct module information. It is
>> now
>> on the trunk, revision 20267. Simultaneously,I checked that all other
>> collective components do the right thing ... and I have to admit tuned
>> was
>> the only faulty one.
>>
>> This is clearly a bug in the tuned, and correcting it will allow
>> people
>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>> segfault when hierarch is active. I would prefer not to give a broken
>> toy
>> out there. How about pushing r20267 in the 1.3?
>>
>> george.
>>
>>
>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>
>>  Thanks for digging into this.  Can you file a bug?  Let's mark it for
>>> v1.3.1.
>>>
>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>> and
>>> since hierarch isn't currently selected by default (you must
>>> specifically
>>> elevate hierarch's priority to get it to run), there's no danger that
>>> users
>>> will run into this problem in default runs.
>>>
>>> But clearly the problem needs to be fixed, and therefore we need a
>>> bug
>>> to track it.
>>>
>>>
>>>
>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>
>>>  I just debugged the Reduce_scatter bug mentioned previously. The bug
 is
 unfortunately not in hierarch, but in tuned.

 Here is the code snipplet causing the problems:

 int reduce_scatter (, mca_coll_base_module_t *module)
 {
 ...
 err = comm->c_coll.coll_reduce (, module)
 ...
 }


 but should be
 {
 ...
 err = comm->c_coll.coll_reduce (...,
 comm->c_coll.coll_reduce_module);
 ...
 }

 The problem as it is right now is, that when using hierarch, only a
 subset of the function are set, e.g. reduce,allreduce, bcast and
 barrier.
 Thus, reduce_scatter is from tuned in most scenarios, and calls the
 subsequent functions with the wrong module. Hierarch of course does
 not like
 that :-)

 Anyway, a quick glance through the tuned code reveals a significant
 number of instances where this appears(reduce_scatter, allreduce,
 allgather,
 allgatherv). Basic, hierarch and inter seem to do that mostly
 correctly.

 Thanks
 Edgar
 --
 Edgar Gabriel
 Assistant Professor
 Parallel Software Technologies Lab  

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Edgar Gabriel
I'm already debugging it. the good news is that it only seems to appear 
with trunk, with 1.3 (after copying the new tuned module over), all the 
tests pass.


Now if somebody can tell me a trick on how to tell mpirun not kill the 
debugger under my feet, then I could even see where the problem occurs:-)


Thanks
Edga

George Bosilca wrote:
All these errors are in the MPI_Finalize, it should not be that hard to 
find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling 
hierarch coll,
there is still a segfault in two of IU's tests that only shows up when 
we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved 
things,

but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to 
file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:


Unfortunately, this pinpoint the fact that we didn't test enough the
collective module mixing thing. I went over the tuned collective 
functions
and changed all instances to use the correct module information. It 
is now

on the trunk, revision 20267. Simultaneously,I checked that all other
collective components do the right thing ... and I have to admit 
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow 
people

to use the hierarch. In the current incarnation 1.3 will mostly/always
segfault when hierarch is active. I would prefer not to give a 
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:


Thanks for digging into this.  Can you file a bug?  Let's mark it for
v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, 
and
since hierarch isn't currently selected by default (you must 
specifically
elevate hierarch's priority to get it to run), there's no danger 
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need a 
bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The 
bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., 
comm->c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a
subset of the function are set, e.g. reduce,allreduce, bcast and 
barrier.

Thus, reduce_scatter is from tuned in most scenarios, and calls the
subsequent functions with the wrong module. Hierarch of course 
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a significant
number of instances where this appears(reduce_scatter, allreduce, 
allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly 
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread George Bosilca
All these errors are in the MPI_Finalize, it should not be that hard  
to find. I'll take a look later this afternoon.


  george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

Unfortunately, although this fixed some problems when enabling  
hierarch coll,
there is still a segfault in two of IU's tests that only shows up  
when we set

-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved  
things,
but there is still more to be done to get coll_hierarch ready for  
regular

use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca  
 wrote:

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have  
to file a

CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

Unfortunately, this pinpoint the fact that we didn't test enough  
the
collective module mixing thing. I went over the tuned collective  
functions
and changed all instances to use the correct module information.  
It is now
on the trunk, revision 20267. Simultaneously,I checked that all  
other
collective components do the right thing ... and I have to admit  
tuned was

the only faulty one.

This is clearly a bug in the tuned, and correcting it will allow  
people
to use the hierarch. In the current incarnation 1.3 will mostly/ 
always
segfault when hierarch is active. I would prefer not to give a  
broken toy

out there. How about pushing r20267 in the 1.3?

george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark  
it for

v1.3.1.

I say 1.3.1 instead of 1.3.0 because this *only* affects  
hierarch, and
since hierarch isn't currently selected by default (you must  
specifically
elevate hierarch's priority to get it to run), there's no danger  
that users

will run into this problem in default runs.

But clearly the problem needs to be fixed, and therefore we need  
a bug

to track it.



On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously.  
The bug is

unfortunately not in hierarch, but in tuned.

Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch,  
only a
subset of the function are set, e.g. reduce,allreduce, bcast  
and barrier.
Thus, reduce_scatter is from tuned in most scenarios, and calls  
the
subsequent functions with the wrong module. Hierarch of course  
does not like

that :-)

Anyway, a quick glance through the tuned code reveals a  
significant
number of instances where this appears(reduce_scatter,  
allreduce, allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly  
correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Tim Mattox
Unfortunately, although this fixed some problems when enabling hierarch coll,
there is still a segfault in two of IU's tests that only shows up when we set
-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved things,
but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca  wrote:
> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
>  george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>> Let's debate tomorrow when people are around, but first you have to file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective functions
>>> and changed all instances to use the correct module information. It is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow people
>>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
 Thanks for digging into this.  Can you file a bug?  Let's mark it for
 v1.3.1.

 I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and
 since hierarch isn't currently selected by default (you must specifically
 elevate hierarch's priority to get it to run), there's no danger that users
 will run into this problem in default runs.

 But clearly the problem needs to be fixed, and therefore we need a bug
 to track it.



 On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

> I just debugged the Reduce_scatter bug mentioned previously. The bug is
> unfortunately not in hierarch, but in tuned.
>
> Here is the code snipplet causing the problems:
>
> int reduce_scatter (, mca_coll_base_module_t *module)
> {
> ...
> err = comm->c_coll.coll_reduce (, module)
> ...
> }
>
>
> but should be
> {
> ...
> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
> ...
> }
>
> The problem as it is right now is, that when using hierarch, only a
> subset of the function are set, e.g. reduce,allreduce, bcast and barrier.
> Thus, reduce_scatter is from tuned in most scenarios, and calls the
> subsequent functions with the wrong module. Hierarch of course does not 
> like
> that :-)
>
> Anyway, a quick glance through the tuned code reveals a significant
> number of instances where this appears(reduce_scatter, allreduce, 
> allgather,
> allgatherv). Basic, hierarch and inter seem to do that mostly correctly.
>
> Thanks
> Edgar
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab  http://pstl.cs.uh.edu
> Department of Computer Science  University of Houston
> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


 --
 Jeff Squyres
 Cisco Systems

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread George Bosilca

Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

  george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

Let's debate tomorrow when people are around, but first you have to  
file a CMR... :-)


On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

Unfortunately, this pinpoint the fact that we didn't test enough  
the collective module mixing thing. I went over the tuned  
collective functions and changed all instances to use the correct  
module information. It is now on the trunk, revision 20267.  
Simultaneously,I checked that all other collective components do  
the right thing ... and I have to admit tuned was the only faulty  
one.


This is clearly a bug in the tuned, and correcting it will allow  
people to use the hierarch. In the current incarnation 1.3 will  
mostly/always segfault when hierarch is active. I would prefer not  
to give a broken toy out there. How about pushing r20267 in the 1.3?


george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it  
for v1.3.1.


I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,  
and since hierarch isn't currently selected by default (you must  
specifically elevate hierarch's priority to get it to run),  
there's no danger that users will run into this problem in default  
runs.


But clearly the problem needs to be fixed, and therefore we need a  
bug to track it.




On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The  
bug is unfortunately not in hierarch, but in tuned.


Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only  
a subset of the function are set, e.g. reduce,allreduce, bcast  
and barrier. Thus, reduce_scatter is from tuned in most  
scenarios, and calls the subsequent functions with the wrong  
module. Hierarch of course does not like that :-)


Anyway, a quick glance through the tuned code reveals a  
significant number of instances where this  
appears(reduce_scatter, allreduce, allgather, allgatherv). Basic,  
hierarch and inter seem to do that mostly correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-13 Thread Tim Mattox
George, I suggest that you file a CMR for r20267 and we can
go from there.  If it makes 1.3 it makes it, otherwise we have
it ready for 1.3.1  At this point the earliest 1.3 will go out is
Wednesday late morning (presuming I'm the one moving
the bits), and is more likely to hit the website in the afternoon.
By morning I'll have odin's MTT results on r20267 on the trunk
using hierarch, and we can see if it fixed the problem.

On Tue, Jan 13, 2009 at 10:28 PM, George Bosilca  wrote:
> Unfortunately, this pinpoint the fact that we didn't test enough the
> collective module mixing thing. I went over the tuned collective functions
> and changed all instances to use the correct module information. It is now
> on the trunk, revision 20267. Simultaneously,I checked that all other
> collective components do the right thing ... and I have to admit tuned was
> the only faulty one.
>
> This is clearly a bug in the tuned, and correcting it will allow people to
> use the hierarch. In the current incarnation 1.3 will mostly/always segfault
> when hierarch is active. I would prefer not to give a broken toy out there.
> How about pushing r20267 in the 1.3?
>
>  george.
>
>
> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>
>> Thanks for digging into this.  Can you file a bug?  Let's mark it for
>> v1.3.1.
>>
>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and
>> since hierarch isn't currently selected by default (you must specifically
>> elevate hierarch's priority to get it to run), there's no danger that users
>> will run into this problem in default runs.
>>
>> But clearly the problem needs to be fixed, and therefore we need a bug to
>> track it.
>>
>>
>>
>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>
>>> I just debugged the Reduce_scatter bug mentioned previously. The bug is
>>> unfortunately not in hierarch, but in tuned.
>>>
>>> Here is the code snipplet causing the problems:
>>>
>>> int reduce_scatter (, mca_coll_base_module_t *module)
>>> {
>>> ...
>>>  err = comm->c_coll.coll_reduce (, module)
>>> ...
>>> }
>>>
>>>
>>> but should be
>>> {
>>> ...
>>> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
>>> ...
>>> }
>>>
>>> The problem as it is right now is, that when using hierarch, only a
>>> subset of the function are set, e.g. reduce,allreduce, bcast and barrier.
>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the
>>> subsequent functions with the wrong module. Hierarch of course does not like
>>> that :-)
>>>
>>> Anyway, a quick glance through the tuned code reveals a significant
>>> number of instances where this appears(reduce_scatter, allreduce, allgather,
>>> allgatherv). Basic, hierarch and inter seem to do that mostly correctly.
>>>
>>> Thanks
>>> Edgar
>>> --
>>> Edgar Gabriel
>>> Assistant Professor
>>> Parallel Software Technologies Lab  http://pstl.cs.uh.edu
>>> Department of Computer Science  University of Houston
>>> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-13 Thread Jeff Squyres
Let's debate tomorrow when people are around, but first you have to  
file a CMR... :-)


On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

Unfortunately, this pinpoint the fact that we didn't test enough the  
collective module mixing thing. I went over the tuned collective  
functions and changed all instances to use the correct module  
information. It is now on the trunk, revision 20267.  
Simultaneously,I checked that all other collective components do the  
right thing ... and I have to admit tuned was the only faulty one.


This is clearly a bug in the tuned, and correcting it will allow  
people to use the hierarch. In the current incarnation 1.3 will  
mostly/always segfault when hierarch is active. I would prefer not  
to give a broken toy out there. How about pushing r20267 in the 1.3?


 george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it  
for v1.3.1.


I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,  
and since hierarch isn't currently selected by default (you must  
specifically elevate hierarch's priority to get it to run), there's  
no danger that users will run into this problem in default runs.


But clearly the problem needs to be fixed, and therefore we need a  
bug to track it.




On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The  
bug is unfortunately not in hierarch, but in tuned.


Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only  
a subset of the function are set, e.g. reduce,allreduce, bcast and  
barrier. Thus, reduce_scatter is from tuned in most scenarios, and  
calls the subsequent functions with the wrong module. Hierarch of  
course does not like that :-)


Anyway, a quick glance through the tuned code reveals a  
significant number of instances where this appears(reduce_scatter,  
allreduce, allgather, allgatherv). Basic, hierarch and inter seem  
to do that mostly correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-13 Thread George Bosilca
Unfortunately, this pinpoint the fact that we didn't test enough the  
collective module mixing thing. I went over the tuned collective  
functions and changed all instances to use the correct module  
information. It is now on the trunk, revision 20267. Simultaneously,I  
checked that all other collective components do the right thing ...  
and I have to admit tuned was the only faulty one.


This is clearly a bug in the tuned, and correcting it will allow  
people to use the hierarch. In the current incarnation 1.3 will mostly/ 
always segfault when hierarch is active. I would prefer not to give a  
broken toy out there. How about pushing r20267 in the 1.3?


  george.


On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:

Thanks for digging into this.  Can you file a bug?  Let's mark it  
for v1.3.1.


I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,  
and since hierarch isn't currently selected by default (you must  
specifically elevate hierarch's priority to get it to run), there's  
no danger that users will run into this problem in default runs.


But clearly the problem needs to be fixed, and therefore we need a  
bug to track it.




On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The  
bug is unfortunately not in hierarch, but in tuned.


Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
 err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a  
subset of the function are set, e.g. reduce,allreduce, bcast and  
barrier. Thus, reduce_scatter is from tuned in most scenarios, and  
calls the subsequent functions with the wrong module. Hierarch of  
course does not like that :-)


Anyway, a quick glance through the tuned code reveals a significant  
number of instances where this appears(reduce_scatter, allreduce,  
allgather, allgatherv). Basic, hierarch and inter seem to do that  
mostly correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-13 Thread Jeff Squyres
Thanks for digging into this.  Can you file a bug?  Let's mark it for  
v1.3.1.


I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and  
since hierarch isn't currently selected by default (you must  
specifically elevate hierarch's priority to get it to run), there's no  
danger that users will run into this problem in default runs.


But clearly the problem needs to be fixed, and therefore we need a bug  
to track it.




On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

I just debugged the Reduce_scatter bug mentioned previously. The bug  
is unfortunately not in hierarch, but in tuned.


Here is the code snipplet causing the problems:

int reduce_scatter (, mca_coll_base_module_t *module)
{
...
  err = comm->c_coll.coll_reduce (, module)
...
}


but should be
{
...
 err = comm->c_coll.coll_reduce (..., comm- 
>c_coll.coll_reduce_module);

...
}

The problem as it is right now is, that when using hierarch, only a  
subset of the function are set, e.g. reduce,allreduce, bcast and  
barrier. Thus, reduce_scatter is from tuned in most scenarios, and  
calls the subsequent functions with the wrong module. Hierarch of  
course does not like that :-)


Anyway, a quick glance through the tuned code reveals a significant  
number of instances where this appears(reduce_scatter, allreduce,  
allgather, allgatherv). Basic, hierarch and inter seem to do that  
mostly correctly.


Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems