Terry, The test succeeded in both of your runs.
However, I rolled back before the epoch change (24814) and the output is the following: MPITEST info (0): Starting MPI_Errhandler_fatal test MPITEST info (0): This test will abort after printing the results message MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted [dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send [dancer.eecs.utk.edu:16098] *** reported by process [766095392769,139869904961537] [dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0 [dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank [dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dancer.eecs.utk.edu:16098] *** and potentially your MPI job) MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4) [dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [dancer.eecs.utk.edu:16096] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages As you can see it is identical to the output in your test. george. On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote: > Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything > is the same except I don't see the "readv failed.." message. > > Have your tried to run this code yourself? It is pretty simple and fails > with one node using np=4. > > --td > > On 8/18/2011 10:57 AM, Wesley Bland wrote: >> I just checked in a fix (I hope). I think the problem was that the errmgr >> was removing children from the list of odls children without using the >> mutex to prevent race conditions. Let me know if the MTT is still having >> problems tomorrow. >> >> Wes >> >> >>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and >>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not >>> seen this test failing under MTT until the epoch code was added. So I >>> have a suspicion the epoch code might be at fault. Could someone >>> familiar with the epoch changes (Wesley) take a look at this failure. >>> >>> Note this intermittently fails but fails for me more times than not. >>> Attached is a log file of a run that succeeds followed by the failing >>> run. The piece of concern are the messages involving >>> mca_oob_tcp_msg_recv and below. >>> >>> thanks, >>> >>> -- >>> Oracle >>> Terry D. Dontje | Principal Software Engineer >>> Developer Tools Engineering | +1.781.442.2631 >>> Oracle *- Performance Technologies* >>> 95 Network Drive, Burlington, MA 01803 >>> Email >>> terry.don...@oracle.com <mailto:terry.don...@oracle.com> >>> >>> >>> >>> >>> >>> > > -- > <Mail Attachment.gif> > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.781.442.2631 > Oracle - Performance Technologies > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel