Starting with few days ago, I notice that more and more orted are left over after my runs. Usually, if the job run to completions they disappear. But if I kill the job or it segfault they don't. I attached to one of them and I get the following stack:

#0  0x9001f7a8 in select ()
#1 0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664) at ../../../ompi-trunk/opal/event/select.c:202 #2 0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/ opal/event/event.c:485 #3 0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/ orte/mca/iof/base/iof_base_flush.c:111 #4 0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9, cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/ pls_fork_module.c:175 #5 0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/ runtime/orte_wait.c:500 #6 0x00210ac8 in orte_wait_signal_callback (fd=20, event=8, arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366 #7 0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/ opal/event/event.c:428 #8 0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/ opal/event/event.c:513 #9 0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/ opal_progress.c:259 #10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0) at ../../../../../ompi-trunk/opal/threads/condition.h:81 #11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi- trunk/orte/mca/pls/fork/pls_fork_module.c:764 #12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/ orte/mca/pls/base/pls_base_close.c:42 #13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi- trunk/orte/mca/rmgr/urm/rmgr_urm.c:521 #14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/ orte/mca/rmgr/base/rmgr_base_close.c:39 #15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/ runtime/orte_system_finalize.c:65 #16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/ orte_finalize.c:42 #17 0x00002ac8 in main (argc=19, argv=0xbffff17c) at ../../../../ompi- trunk/orte/tools/orted/orted.c:377

Somehow, it wait for the pid 9045. But this was one of the kids, and it get the SIG_KILL signal (I checked with strace). I wonder if we don't have a race condition somewhere on the wait_signal code.

Hope that helps,
  george.

Reply via email to