Hey Mark Can you do me a favor? I’m totally buried, but I have been able to replicate this on my machine, so it is a definite race condition.
What would really help me is if you could do the following: * start the orte-dvm with the “—mca state_base_verbose 10” option, and capture stdout/stderr in a file * run your test program - I don’t think you need all 40 submits, so you might try a smaller number just to make life easier * reorganize the output, organizing the resulting state machine debug output according to the job. You’ll see output like this: [bend001:22303] [[55873,0],0] state:base:check_job_completed job [55873,2] is terminated (1 vs 1 [NORMALLY TERMINATED]) [bend001:22303] [[55873,0],0] ACTIVATE JOB [55873,32] STATE NOTIFY COMPLETED AT state_dvm.c:415 etc. If you could collect all the state output for each job, it would really help me to identify the last state each job reached. I could then see what state the jobs that aren’t being properly marked as terminated finished in. I hate to ask it of you - I just don’t have time to do all that sorting right now. If you’d prefer to decline, feel free to do so and I’ll attack this when I next have a chance. Ralph > On Oct 15, 2015, at 3:50 PM, Mark Santcroos <mark.santcr...@rutgers.edu> > wrote: > > >> On 16 Oct 2015, at 0:44 , Ralph Castain <r...@open-mpi.org> wrote: >> >> Hmmm....ok. I'll have to look at it this weekend when I return from travel. >> Can you please send me your test program so I can try to locally reproduce >> it? > > Ok, thanks Ralph. > > > Start the DVM with: orte-dvm --report-uri dvm_uri --debug-devel > > And then run the following script. The "serial /bin/date" and the "parallel > sleep 1" are fine. The "parallel /bin/date" shows the hanging. > > > #!/bin/sh > > for i in $(seq 42): > do > # GOOD > #orte-submit --hnp file:dvm_uri -np 1 /bin/date > #orte-submit --hnp file:dvm_uri -np 1 /bin/sleep 1 & > > # BAD > orte-submit --hnp file:dvm_uri -np 1 /bin/date & > done > wait > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18188.php