We are running on a RHEL EL4 2.6.9-42EL kernel on a rocks install. The tests I run are IMB with Intel MPI over uDAPL and at the same time as IMB over IPopIB. It usiually takes at least 1 day sometimes 2 days of running IMB in a loop with various number of processes per node, 1,2, and 4. It seems to fail randomly, not on the same node everytime, so it is not feasible to connect a serial console to every node. It would also be hard for us to put in a new kernel as this has problems with rocks. The systems are the older Xeon, Lindenhurst, 3.6Ghz
I have not seen this error on any other kernel or system, I have tested RHEL5 and RHEL4-U5, but only on 2 nodes, but that does not seem to fail. We also having OFED 1.2 running on a 64 and 256 node production applications development clusters and they have not reported any similar problems, but they are not running the same tests. I plan on loading OFED 1.2-rc5 today. Is there an easy way to build the IPoIB driver from the OFED installer so that it has debug enabled ? woody -----Original Message----- From: Michael S. Tsirkin [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 13, 2007 11:10 AM To: Hefty, Sean Cc: 'Michael S. Tsirkin'; Sean Hefty; Woodruff, Robert J; 'Vladimir Sokolovsky'; [email protected] Subject: Re: Re: crash in ipoib > Quoting Sean Hefty <[EMAIL PROTECTED]>: > Subject: RE: Re: crash in ipoib > > >This looks strange. Can you supply some more data please? > >Which HCA are you running on? > >What test are you running? > >What should I do to reproduce this? > >Further, could you supply the full oops? > > Woody will need to answer the test/config questions. The oops is only displayed > on the screen, and the stack trace is about 50-75 calls long. The start of the > oops gets pushed off the screen. (Can we be overrunning the stack?) I'm not at > the systems today, but can probably get what else is available tomorrow. Getting a serial console would be the thing to do then. If you are worried about stack overflow, build your kernel with stack instrumentation. It's quite likely the real oops reason has scrolled off the screen, what you post here could be thre result of fullowing memory corruption. > We have, I think, up to 16 systems running the tests, and we only see failures > on specific nodes (which all happen to be the same type of system > ). One thing to try to check is whether it's kernel-specific. What happens if you install a different kernel/OS there? Try RHEL5 or just build 2.6.20 kernel there. Does it still happen? -- MST _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
