Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres

On Jul 29, 2008, at 9:47 AM, Jeff Squyres wrote:

Ok.  FWIW, Pasha and I think that openib has supported "send-to- 
self" for a while (we don't know exactly when; but Pasha thinks it  
is very old code that we don't check for self in add_procs).  But it  
only broke recently.



More in the FWIW category -- we just checked, and OMPI v1.2 supported  
"--mca btl openib" (note the lack of ",self").  So the openib BTL has,  
indeed, supported send-to-self for quite a while.


This should help narrow where to start looking for the problem:  
changes within the last few weeks.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres
Ok.  FWIW, Pasha and I think that openib has supported "send-to-self"  
for a while (we don't know exactly when; but Pasha thinks it is very  
old code that we don't check for self in add_procs).  But it only  
broke recently.



On Jul 29, 2008, at 9:31 AM, George Bosilca wrote:

I ran few tests and the only combination leading to a deadlock is  
openib and self. As openib is the only BTL supporting self  
communications (except self of course), I guess it interfere with  
self in some more or less strange ways. I didn't had the time to dig  
deeper yet to see what exactly happens there, I'll schedule this  
later today.


 george.

On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago  
(Pasha: do you remember?) because Mellanox HCAs are capable of  
send-to-self (process) and there were no code changes necessary to  
enable it.  So it allowed a slightly simpler command line.  This  
was quite a while ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Pavel Shamis (Pasha)

Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago (Pasha: do 
you remember?) because Mellanox HCAs are capable of send-to-self 
(process) and there were no code changes necessary to enable it.  So 
it allowed a slightly simpler command line.  This was quite a while 
ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje  wrote:

> Jeff Squyres wrote:
>
>> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:
>>
>>  Interesting. The self is only used for local communications. I don't
>>> expect that any benchmark execute such communications, but apparently I was
>>> wrong. Please let me know the failing test, I will take a look this evening.
>>>
>>
>> FWIW, my manual tests of a simplistic "ring" program work for all
>> combinations (openib, openib+self, openib+self+sm).  Shrug.
>>
>> But for OSU latency, I found that openib, openib+sm work, but
>> openib+sm+self hangs (same results whether the 2 procs are on the same node
>> or different nodes).  There is no self communication in osu_latency, so
>> something else must be going on.
>>
>>  Is it something to do with the MPI_Barrier call?  osu_latency uses
> MPI_Barrier and from rhc's email it sounds like his code does too.


I don't think it's an issue with MPI_Barrier().  I'm running into this
problem with srtest.c (one of the example programs from the mpich
distribution).  It's a ring-type test with no barriers until the end, yet it
hangs on the very first Send/Recv pair from rank0 to rank1.

I my case, openib and openib+sm works, but openib+self & openib+sm+self
hang.

--brad


>
> --td
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Terry Dontje

Jeff Squyres wrote:

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't 
expect that any benchmark execute such communications, but apparently 
I was wrong. Please let me know the failing test, I will take a look 
this evening.


FWIW, my manual tests of a simplistic "ring" program work for all 
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but 
openib+sm+self hangs (same results whether the 2 procs are on the same 
node or different nodes).  There is no self communication in 
osu_latency, so something else must be going on.


Is it something to do with the MPI_Barrier call?  osu_latency uses 
MPI_Barrier and from rhc's email it sounds like his code does too.


--td


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but  
apparently I was wrong. Please let me know the failing test, I will  
take a look this evening.


FWIW, my manual tests of a simplistic "ring" program work for all  
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but openib+sm 
+self hangs (same results whether the 2 procs are on the same node or  
different nodes).  There is no self communication in osu_latency, so  
something else must be going on.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but apparently  
I was wrong. Please let me know the failing test, I will take a look  
this evening.


  Thanks,
george.

On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote:


I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


george.

PS: Btw, it is not supposed to work at all, except in the case  
where openib handle internal messages (where the source and  
destination is the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain

I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


 george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


  george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain


On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use  
self btl.




Don't know - could be true. But if that is true, then we should check  
to see if that condition is met and error out - with an appropriate  
message - if so. Otherwise, how is a user supposed to know this  
condition?





On 7/28/08, Jeff Squyres  wrote:
FWIW, all my MTT runs are hanging as well.



On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use self
btl.

On 7/28/08, Jeff Squyres  wrote:
>
> FWIW, all my MTT runs are hanging as well.
>
>
> On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:
>
>  My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
>> systems and tests using --mca btl  openib,self hang in all cases.
>>
>> --brad
>>
>>
>> 2008/7/28 Lenny Verkhovsky 
>> I failed to run on different nodes or on the same node via self,openib
>>
>>
>>
>>
>> On 7/28/08, Ralph Castain  wrote:
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>>  It could also be something new. Brad and I noted on Fri that IB was
>>> locking up as soon as we tried any cross-node communications. Hadn't seen
>>> that before, and at least I haven't explored it further - planned to do so
>>> today.
>>>
>>>
>>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>>
>>>  I believe it it.

 On 7/28/08, Jeff Squyres  wrote: On Jul 28, 2008,
 at 7:51 AM, Jeff Squyres wrote:

 Is this related to r1378?

 Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



 On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

 Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel


 --
 Jeff Squyres
 Cisco Systems



 --
 Jeff Squyres
 Cisco Systems

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

FWIW, all my MTT runs are hanging as well.


On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
Interesting - you are quite correct and I should have been more  
precise. I ran with -mca btl openib and it worked. So having just  
openib seems to be okay.




On Jul 28, 2008, at 8:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
systems and tests using --mca btl  openib,self hang in all cases.

--brad


2008/7/28 Lenny Verkhovsky 

> I failed to run on different nodes or on the same node via self,openib
>
>
>
> On 7/28/08, Ralph Castain  wrote:
>>
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>> It could also be something new. Brad and I noted on Fri that IB was
>> locking up as soon as we tried any cross-node communications. Hadn't seen
>> that before, and at least I haven't explored it further - planned to do so
>> today.
>>
>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>
>> I believe it it.
>>
>> On 7/28/08, Jeff Squyres  wrote:
>>>
>>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>>
>>>  Is this related to r1378?

>>>
>>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>>
>>>
>>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

  Hi,
>
> I experience hanging of tests ( latency ) since r19010
>
>
> Best Regards
>
> Lenny.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


 --
 Jeff Squyres
 Cisco Systems


>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I failed to run on different nodes or on the same node via self,openib



On 7/28/08, Ralph Castain  wrote:
>
> I checked this out some more and I believe it is ticket #1378 related. We
> lock up if SM is included in the BTL's, which is what I had done on my test.
> If I ^sm, I can run fine.
>
> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>
> It could also be something new. Brad and I noted on Fri that IB was locking
> up as soon as we tried any cross-node communications. Hadn't seen that
> before, and at least I haven't explored it further - planned to do so today.
>
> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>
> I believe it it.
>
> On 7/28/08, Jeff Squyres  wrote:
>>
>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>
>>  Is this related to r1378?
>>>
>>
>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>
>>
>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>>
>>>  Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
I checked this out some more and I believe it is ticket #1378 related.  
We lock up if SM is included in the BTL's, which is what I had done on  
my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I believe it it.

On 7/28/08, Jeff Squyres  wrote:
>
> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>
>  Is this related to r1378?
>>
>
> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>
>
>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>
>>  Hi,
>>>
>>> I experience hanging of tests ( latency ) since r19010
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?


Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:


Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




--
Jeff Squyres
Cisco Systems