I'm getting a pile of test failures when running with the openib and tcp BTLs on the trunk. Gleb is getting some failures, too, but his seem to be different than mine.

Here's what I'm seeing from manual MTT runs on my SVN/development install -- did you know that MTT could do that? :-)

+-------------+-------------------+------+------+----------+------+
| Phase       | Section           | Pass | Fail | Time out | Skip |
+-------------+-------------------+------+------+----------+------+
| Test Run    | intel             | 442  | 0    | 26       | 0    |
| Test Run    | ibm               | 173  | 3    | 1        | 3    |
+-------------+-------------------+------+------+----------+------+

The tests that are failing are:

*** WARNING: Test: MPI_Recv_pack_c, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: MPI_Ssend_ator_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Irecv_pack_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Isend_ator_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Irsend_rtoa_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Ssend_rtoa_c, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: MPI_Send_rtoa_c, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: MPI_Send_ator_c, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: MPI_Rsend_rtoa_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Reduce_loc_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Isend_ator2_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Issend_rtoa_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Isend_rtoa_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Send_ator2_c, np=16, variant=1: TIMED OUT (failed) *** WARNING: Test: MPI_Issend_ator_c, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: comm_join, np=16, variant=1: TIMED OUT (failed)
*** WARNING: Test: getcount, np=16, variant=1: FAILED
*** WARNING: Test: spawn, np=3, variant=1: FAILED
*** WARNING: Test: spawn_multiple, np=3, variant=1: FAILED

I'm not too worried about the comm spawn/join tests because I think they're heavily oversubscribing the nodes and therefore timing out. These were all from a default trunk build running with "mpirun --mca btl openib,self".

For all of these tests, I'm running on 4 nodes, 4 cores each, but they have varying numbers of network interfaces:

          nodes 1,2          nodes 3,4
openib    3 active ports     2 active ports
tcp       4 tcp interfaces   3 tcp interfaces

Is anyone else seeing these kinds of failures?

--
Jeff Squyres
Cisco Systems

Reply via email to