Re: [OMPI devel] trunk hangs since r19010
On Jul 29, 2008, at 9:47 AM, Jeff Squyres wrote: Ok. FWIW, Pasha and I think that openib has supported "send-to- self" for a while (we don't know exactly when; but Pasha thinks it is very old code that we don't check for self in add_procs). But it only broke recently. More in the FWIW category -- we just checked, and OMPI v1.2 supported "--mca btl openib" (note the lack of ",self"). So the openib BTL has, indeed, supported send-to-self for quite a while. This should help narrow where to start looking for the problem: changes within the last few weeks. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Ok. FWIW, Pasha and I think that openib has supported "send-to-self" for a while (we don't know exactly when; but Pasha thinks it is very old code that we don't check for self in add_procs). But it only broke recently. On Jul 29, 2008, at 9:31 AM, George Bosilca wrote: I ran few tests and the only combination leading to a deadlock is openib and self. As openib is the only BTL supporting self communications (except self of course), I guess it interfere with self in some more or less strange ways. I didn't had the time to dig deeper yet to see what exactly happens there, I'll schedule this later today. george. On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. Yep, Correct. FYI. In my MTT testing I also see a lot of killed tests. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Jeff Squyres wrote: This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. Yep, Correct. FYI. In my MTT testing I also see a lot of killed tests.
Re: [OMPI devel] trunk hangs since r19010
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontjewrote: > Jeff Squyres wrote: > >> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: >> >> Interesting. The self is only used for local communications. I don't >>> expect that any benchmark execute such communications, but apparently I was >>> wrong. Please let me know the failing test, I will take a look this evening. >>> >> >> FWIW, my manual tests of a simplistic "ring" program work for all >> combinations (openib, openib+self, openib+self+sm). Shrug. >> >> But for OSU latency, I found that openib, openib+sm work, but >> openib+sm+self hangs (same results whether the 2 procs are on the same node >> or different nodes). There is no self communication in osu_latency, so >> something else must be going on. >> >> Is it something to do with the MPI_Barrier call? osu_latency uses > MPI_Barrier and from rhc's email it sounds like his code does too. I don't think it's an issue with MPI_Barrier(). I'm running into this problem with srtest.c (one of the example programs from the mpich distribution). It's a ring-type test with no barriers until the end, yet it hangs on the very first Send/Recv pair from rank0 to rank1. I my case, openib and openib+sm works, but openib+self & openib+sm+self hang. --brad > > --td > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
Jeff Squyres wrote: On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm+self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. Is it something to do with the MPI_Barrier call? osu_latency uses MPI_Barrier and from rhc's email it sounds like his code does too. --td
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm +self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. Thanks, george. On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote: I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyreswrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyreswrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyreswrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyreswrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. On 7/28/08, Jeff Squyreswrote: > > FWIW, all my MTT runs are hanging as well. > > > On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: > > My experience is the same a Lenny's. I've tested on x86_64 and ppc64 >> systems and tests using --mca btl openib,self hang in all cases. >> >> --brad >> >> >> 2008/7/28 Lenny Verkhovsky >> I failed to run on different nodes or on the same node via self,openib >> >> >> >> >> On 7/28/08, Ralph Castain wrote: >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >>> locking up as soon as we tried any cross-node communications. Hadn't seen >>> that before, and at least I haven't explored it further - planned to do so >>> today. >>> >>> >>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >>> >>> I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny VerkhovskyI failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Interesting - you are quite correct and I should have been more precise. I ran with -mca btl openib and it worked. So having just openib seems to be okay. On Jul 28, 2008, at 8:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny VerkhovskyI failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky> I failed to run on different nodes or on the same node via self,openib > > > > On 7/28/08, Ralph Castain wrote: >> >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >> locking up as soon as we tried any cross-node communications. Hadn't seen >> that before, and at least I haven't explored it further - planned to do so >> today. >> >> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >> >> I believe it it. >> >> On 7/28/08, Jeff Squyres wrote: >>> >>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >>> >>> Is this related to r1378? >>> >>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >>> >>> >>> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, > > I experience hanging of tests ( latency ) since r19010 > > > Best Regards > > Lenny. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres Cisco Systems >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castainwrote: > > I checked this out some more and I believe it is ticket #1378 related. We > lock up if SM is included in the BTL's, which is what I had done on my test. > If I ^sm, I can run fine. > > On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: > > It could also be something new. Brad and I noted on Fri that IB was locking > up as soon as we tried any cross-node communications. Hadn't seen that > before, and at least I haven't explored it further - planned to do so today. > > On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: > > I believe it it. > > On 7/28/08, Jeff Squyres wrote: >> >> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >> >> Is this related to r1378? >>> >> >> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >> >> >> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >>> >>> Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyreswrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
I believe it it. On 7/28/08, Jeff Squyreswrote: > > On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: > > Is this related to r1378? >> > > Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. > > > On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >> >> Hi, >>> >>> I experience hanging of tests ( latency ) since r19010 >>> >>> >>> Best Regards >>> >>> Lenny. >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems