Hey Mark

Can you do me a favor? I’m totally buried, but I have been able to replicate 
this on my machine, so it is a definite race condition.

What would really help me is if you could do the following:

* start the orte-dvm with the “—mca state_base_verbose 10” option, and capture 
stdout/stderr in a file

* run your test program - I don’t think you need all 40 submits, so you might 
try a smaller number just to make life easier

* reorganize the output, organizing the resulting state machine debug output 
according to the job. You’ll see output like this:

[bend001:22303] [[55873,0],0] state:base:check_job_completed job [55873,2] is 
terminated (1 vs 1 [NORMALLY TERMINATED])

[bend001:22303] [[55873,0],0] ACTIVATE JOB [55873,32] STATE NOTIFY COMPLETED AT 
state_dvm.c:415

etc. If you could collect all the state output for each job, it would really 
help me to identify the last state each job reached. I could then see what 
state the jobs that aren’t being properly marked as terminated finished in.

I hate to ask it of you - I just don’t have time to do all that sorting right 
now. If you’d prefer to decline, feel free to do so and I’ll attack this when I 
next have a chance.

Ralph


> On Oct 15, 2015, at 3:50 PM, Mark Santcroos <mark.santcr...@rutgers.edu> 
> wrote:
> 
> 
>> On 16 Oct 2015, at 0:44 , Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Hmmm....ok. I'll have to look at it this weekend when I return from travel. 
>> Can you please send me your test program so I can try to locally reproduce 
>> it?
> 
> Ok, thanks Ralph.
> 
> 
> Start the DVM with: orte-dvm --report-uri dvm_uri --debug-devel
> 
> And then run the following script. The "serial /bin/date" and the "parallel 
> sleep 1" are fine. The "parallel /bin/date" shows the hanging.
> 
> 
> #!/bin/sh
> 
> for i in $(seq 42):
> do
>    # GOOD
>    #orte-submit --hnp file:dvm_uri -np 1 /bin/date
>    #orte-submit --hnp file:dvm_uri -np 1 /bin/sleep 1 &
> 
>    # BAD
>    orte-submit --hnp file:dvm_uri -np 1 /bin/date &
> done
> wait
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/10/18188.php

Reply via email to