Thought I'd throw this out there, I retraced my MTT steps and did find that there were failures of this test back until r24774. r24775 has a comment that looks very relevant. I am talking to the committer of that change now.

Sorry for the false accusation.

--td

On 8/18/2011 2:32 PM, George Bosilca wrote:
Terry,

The test succeeded in both of your runs.

However, I rolled back before the epoch change (24814) and the output is the 
following:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[dancer.eecs.utk.edu:16098] *** An error occurred in MPI_Send
[dancer.eecs.utk.edu:16098] *** reported by process 
[766095392769,139869904961537]
[dancer.eecs.utk.edu:16098] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[dancer.eecs.utk.edu:16098] *** MPI_ERR_RANK: invalid rank
[dancer.eecs.utk.edu:16098] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[dancer.eecs.utk.edu:16098] ***    and potentially your MPI job)
MPITEST_results: MPI_Errhandler_fatal all tests PASSED (4)
[dancer.eecs.utk.edu:16096] [[24280,0],0]-[[24280,1],3] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)
[dancer.eecs.utk.edu:16096] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[dancer.eecs.utk.edu:16096] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

As you can see it is identical to the output in your test.

   george.


On Aug 18, 2011, at 12:29 , TERRY DONTJE wrote:

Just ran MPI_Errhandler_fatal_c with r25063 and it still fails.  Everything is the same 
except I don't see the "readv failed.." message.

Have your tried to run this code yourself?  It is pretty simple and fails with 
one node using np=4.

--td

On 8/18/2011 10:57 AM, Wesley Bland wrote:
I just checked in a fix (I hope). I think the problem was that the errmgr
was removing children from the list of odls children without using the
mutex to prevent race conditions. Let me know if the MTT is still having
problems tomorrow.

Wes


I am seeing the intel test suite tests MPI_Errhandler_fatal_c and
MPI_Errhandler_fatal_f fail with an oob failure quite a bit  I have not
seen this test failing under MTT until the epoch code was added.  So I
have a suspicion the epoch code might be at fault.  Could someone
familiar with the epoch changes (Wesley) take a look at this failure.

Note this intermittently fails but fails for me more times than not.
Attached is a log file of a run that succeeds followed by the failing
run.  The piece of concern are the messages involving
mca_oob_tcp_msg_recv and below.

thanks,

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email
terry.don...@oracle.com<mailto:terry.don...@oracle.com>






--
<Mail Attachment.gif>
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to