On Aug 18, 2011, at 14:58 , TERRY DONTJE wrote: > > > On 8/18/2011 2:32 PM, George Bosilca wrote: >> Terry, >> >> The test succeeded in both of your runs. >> > Not really. Granted the test aborted in both cases however the case you > show below has further issues while the orte is trying to clean things up. > It certainly is not what I would call friendly. But that is besides the > point, the issue is orte is having issues with MPI_Errhandler_fatal_c test > IMO and it looks like you have seen the same failure prior to the epoch > changes. Fair enough, I'll go back to the drawing board and see if I can > narrow this down.
In fact I see two different behavior. With the version r24814, the test deadlocks as soon as the processes are rolled out over multiple nodes (as soon as orated gets involved). In some cases we do get the spurious read failures. With today's version the mpirun complete successfully independent of the process placement, even if in some of the runs the spurious readv failures can be seen. george. > > --td >> However, I rolled back before the epoch change (24814) and the output is the >> following: >> >> MPITEST info (0): Starting MPI_Errhandler_fatal test >> MPITEST info (0): This test will abort after printing the results message >> MPITEST info (0): If it does not, then a f.a.i.l.u.r.e will be noted >> [dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send >> [dancer.eecs.utk.edu:16098] *** reported by process >> [766095392769,139869904961537] >> [dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0 >> [dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank >> [dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will now abort, >> [dancer.eecs.utk.edu:16098] *** and potentially your MPI job) >> MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4) >> [dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] >> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) >> [dancer.eecs.utk.edu:16096] 3 more processes have sent help message >> help-mpi-errors.txt / mpi_errors_are_fatal >> [dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to >> 0 to see all help / error messages >> >> As you can see it is identical to the output in your test. >> >> george. >> >> >> On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote: >> >> >>> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails. Everything >>> is the same except I don't see the "readv failed.." message. >>> >>> Have your tried to run this code yourself? It is pretty simple and fails >>> with one node using np=4. >>> >>> --td >>> >>> On 8/18/2011 10:57 AM, Wesley Bland wrote: >>> >>>> I just checked in a fix (I hope). I think the problem was that the errmgr >>>> was removing children from the list of odls children without using the >>>> mutex to prevent race conditions. Let me know if the MTT is still having >>>> problems tomorrow. >>>> >>>> Wes >>>> >>>> >>>> >>>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and >>>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit I have not >>>>> seen this test failing under MTT until the epoch code was added. So I >>>>> have a suspicion the epoch code might be at fault. Could someone >>>>> familiar with the epoch changes (Wesley) take a look at this failure. >>>>> >>>>> Note this intermittently fails but fails for me more times than not. >>>>> Attached is a log file of a run that succeeds followed by the failing >>>>> run. The piece of concern are the messages involving >>>>> mca_oob_tcp_msg_recv and below. >>>>> >>>>> thanks, >>>>> >>>>> -- >>>>> Oracle >>>>> Terry D. Dontje | Principal Software Engineer >>>>> Developer Tools Engineering | +1.781.442.2631 >>>>> Oracle *- Performance Technologies* >>>>> 95 Network Drive, Burlington, MA 01803 >>>>> Email >>>>> >>>>> terry.don...@oracle.com <mailto:terry.don...@oracle.com> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> -- >>> <Mail Attachment.gif> >>> Terry D. Dontje | Principal Software Engineer >>> Developer Tools Engineering | +1.781.442.2631 >>> Oracle - Performance Technologies >>> 95 Network Drive, Burlington, MA 01803 >>> Email >>> terry.don...@oracle.com >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > <Mail Attachment.gif> > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.781.442.2631 > Oracle - Performance Technologies > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel