Terry,

The test succeeded in both of your runs.

However, I rolled back before the epoch change (24814) and the output is the 
following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process 
[766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[dancer.eecs.utk.edu:16098] ***    and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

As you can see it is identical to the output in your test.

  george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:

> Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything 
> is the same except I don't see the "readv failed.." message.
> 
> Have your tried to run this code yourself?  It is pretty simple and fails 
> with one node using np=4.
> 
> --td
> 
> On 8/18/2011 10:57 AM, Wesley Bland wrote:
>> I just checked in a fix (I hope). I think the problem was that the errmgr
>> was removing children from the list of odls children without using the
>> mutex to prevent race conditions. Let me know if the MTT is still having
>> problems tomorrow.
>> 
>> Wes
>> 
>> 
>>> I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
>>> MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
>>> seen this test failing under MTT until the epoch code was added.  So I
>>> have a suspicion the epoch code might be at fault.  Could someone
>>> familiar with the epoch changes (Wesley) take a look at this failure.
>>> 
>>> Note this intermittently fails but fails for me more times than not.
>>> Attached is a log file of a run that succeeds followed by the failing
>>> run.  The piece of concern are the messages involving
>>> mca_oob_tcp_msg_recv and below.
>>> 
>>> thanks,
>>> 
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle *- Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email 
>>> terry.don...@oracle.com <mailto:terry.don...@oracle.com>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> -- 
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to