Ralph,

I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
HEAD):

----8<----
MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
--------------------------------------------------------------------------
The user-provided time limit for job execution has been
reached:

  MPIEXEC_TIMEOUT: 8 seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment).

--------------------------------------------------------------------------
srun: error: mpi015: task 0: Killed
srun: Terminating job step 689585.2
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd = 
16]
[savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
mca_oob_tcp_peer_send_handler: unable to send header

^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate

^C
----8<----

Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
beforehand (several minutes for the first two, <5s in the third).

Where "sleeper" is just an MPI program that does:

----8<----
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
    MPI_Comm_size(MPI_COMM_WORLD, &wsize);

    while (1) {
        sleep(60);
    }

    MPI_Finalize();
----8<----

It happens under slurm and SSH.  If I launch on localhost (no --host/--hostfile 
option, no slurm, etc.) then it exits just fine.  The example output I gave 
above used the "usnic" BTL, but "tcp" has identical behavior.

This worked fine in v1.7.4.  I've bisected the change in behavior down to 
r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981

Should I file a ticket?

-Dave

Reply via email to