We fixed the openib segv, but I forgot to followup about the timeouts that I mentioned in my original mail.

The timeouts were from poorly-configured spawn tests. That is, I had 8 cores in the job and ran the spawn test on all 8 cores (all aggressively polling). The spawn test then spawned N more MPI processes each of which also [attempt to] poll heavily. This causes obvious thrashage and the test doesn't complete before the timeout.

This is obviously poorly configured tests on my part and not a real problem (I confirmed by re-running the tests with <8 original MPI procs). So as I mentioned in my prior mail, thumbs up for v1.3 release from my perspective.



On Jan 15, 2009, at 9:05 AM, Jeff Squyres wrote:

Unfortunately, I have to throw the flag in the v1.3 release.  :-(

I ran ~16k tests via MTT yesterday on the rc5 and rc6 tarballs. I found the following:

Found test runs: 15962
Passed: 15785 (98.89%)
Failed: 83 (0.52%)
--> Openib failures: 80 (0.50%)
Skipped: 46 (0.29%)
Timedout: 48 (0.30%)

The 80 openib failures are all seemingly random segv's. I repeated a much smaller run this morning (about 700 runs) and still found a non-zero percentage of fails of the same flavor.

The timeouts are a little worrysome as well.

This unfortunately requires investigation.  :-(

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to