Hi Ralph,
> On 15 Oct 2015, at 0:26 , Ralph Castain <[email protected]> wrote:
> Okay, so each orte-submit is reporting job has launched, which means the hang
> is coming while waiting to hear the job completed. Are you sure that orte-dvm
> believes the job has completed?
No, I'm not.
> In other words, when you say that you observe the job as completing, are you
> basing that on some output from orte-dvm, or because the procs have exited,
> or...?
... because the tasks have created their output.
> I can send you a patch tonight that would cause orte-dvm to emit a "job
> completed" message when it determines each job has terminated - might help us
> take the next step.
Great.
> I'm wondering if orte-dvm thinks the job is still running, and the race
> condition is in that area (as opposed to being in orte-submit itself)
Do some counts from the output of orte-dvm provide some hints?
$ grep "Releasing job data.*INVALID" dvm_output.txt |wc -l
42
$ grep "ORTE_DAEMON_SPAWN_JOB_CMD" dvm_output.txt |wc -l
42
$ grep "ORTE_DAEMON_ADD_LOCAL_PROCS" dvm_output.txt |wc -l
42
$ grep "sess_dir_finalize" dvm_output.txt |wc -l
35
In other words, the "[netbook:XXXX] sess_dir_finalize: proc session dir does
not exist" message doesn't show up for the hanging ones, which could support
your question that the orte-dvm is at fault.
Gr,
Mark