I just checked in a fix (I hope). I think the problem was that the errmgr was removing children from the list of odls children without using the mutex to prevent race conditions. Let me know if the MTT is still having problems tomorrow.
Wes > I am seeing the intel test suite tests MPI_Errhandler_fatal_c and > MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not > seen this test failing under MTT until the epoch code was added. So I > have a suspicion the epoch code might be at fault. Could someone > familiar with the epoch changes (Wesley) take a look at this failure. > > Note this intermittently fails but fails for me more times than not. > Attached is a log file of a run that succeeds followed by the failing > run. The piece of concern are the messages involving > mca_oob_tcp_msg_recv and below. > > thanks, > > -- > Oracle > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.781.442.2631 > Oracle *- Performance Technologies* > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com <mailto:terry.don...@oracle.com> > > > >