This seems to be working, but I think we now have a pid group problem -- I think we need to setpgid() right after the fork. Otherwise, when we kill the group, we might end up killing much more than just the one MPI process (including the orted and/or orted's parent!).
Ping me on IM -- I'm testing this idea and it seems to work properly. On Mar 18, 2014, at 4:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, fixed and cmr'd to you > > > On Mar 18, 2014, at 11:00 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> >> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) <dgood...@cisco.com> >> wrote: >> >>> Ralph, >>> >>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to >>> HEAD): >>> >>> ----8<---- >>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper >>> -------------------------------------------------------------------------- >>> The user-provided time limit for job execution has been >>> reached: >>> >>> MPIEXEC_TIMEOUT: 8 seconds >>> >>> The job will now be aborted. Please check your code and/or >>> adjust/remove the job execution time limit (as specified >>> by MPIEXEC_TIMEOUT in your environment). >>> >>> -------------------------------------------------------------------------- >>> srun: error: mpi015: task 0: Killed >>> srun: Terminating job step 689585.2 >>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] >>> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) >>> [sd = 16] >>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] >>> mca_oob_tcp_peer_send_handler: unable to send header >>> >>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly >>> terminate >>> >>> ^C >>> ----8<---- >>> >>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass >>> beforehand (several minutes for the first two, <5s in the third). >>> >>> Where "sleeper" is just an MPI program that does: >>> >>> ----8<---- >>> MPI_Init(&argc, &argv); >>> MPI_Comm_rank(MPI_COMM_WORLD, &wrank); >>> MPI_Comm_size(MPI_COMM_WORLD, &wsize); >>> >>> while (1) { >>> sleep(60); >>> } >>> >>> MPI_Finalize(); >>> ----8<---- >>> >>> It happens under slurm and SSH. If I launch on localhost (no >>> --host/--hostfile option, no slurm, etc.) then it exits just fine. The >>> example output I gave above used the "usnic" BTL, but "tcp" has identical >>> behavior. >>> >>> This worked fine in v1.7.4. I've bisected the change in behavior down to >>> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981 >>> >>> Should I file a ticket? >>> >> >> Crud - no, I'll take a look in a little bit >> >> >>> -Dave >>> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/03/14367.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/