Starting with few days ago, I notice that more and more orted are
left over after my runs. Usually, if the job run to completions they
disappear. But if I kill the job or it segfault they don't. I
attached to one of them and I get the following stack:
#0 0x9001f7a8 in select ()
#1 0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664)
at ../../../ompi-trunk/opal/event/select.c:202
#2 0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/
opal/event/event.c:485
#3 0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/
orte/mca/iof/base/iof_base_flush.c:111
#4 0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9,
cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/
pls_fork_module.c:175
#5 0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/
runtime/orte_wait.c:500
#6 0x00210ac8 in orte_wait_signal_callback (fd=20, event=8,
arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366
#7 0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/
opal/event/event.c:428
#8 0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/
opal/event/event.c:513
#9 0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/
opal_progress.c:259
#10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0)
at ../../../../../ompi-trunk/opal/threads/condition.h:81
#11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi-
trunk/orte/mca/pls/fork/pls_fork_module.c:764
#12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/
orte/mca/pls/base/pls_base_close.c:42
#13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi-
trunk/orte/mca/rmgr/urm/rmgr_urm.c:521
#14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/
orte/mca/rmgr/base/rmgr_base_close.c:39
#15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/
runtime/orte_system_finalize.c:65
#16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/
orte_finalize.c:42
#17 0x00002ac8 in main (argc=19, argv=0xbffff17c) at ../../../../ompi-
trunk/orte/tools/orted/orted.c:377
Somehow, it wait for the pid 9045. But this was one of the kids, and
it get the SIG_KILL signal (I checked with strace). I wonder if we
don't have a race condition somewhere on the wait_signal code.
Hope that helps,
george.