Thanks for the fix.
-- Josh

On Feb 16, 2009, at 2:59 PM, George Bosilca wrote:

Josh,

Spending few minutes to understand, could have pinpointed you to the real culprit: the tool itself!

The assert in the code state that on finalize there is still a registered signal handler. A quick gdb show that this is for the SIG_CHLD. Tracking the signal addition in the tool (breakpoint in gdb on opal_event_queue_insert) clearly highlight the place where this happens, i.e. orte_wait_init in orte/runtime/orte_wait.c:274. So far so good, we're right of tracking the SIG_CHLD, but we're not supposed to leave it there when we're done (as the signal is registered with the PERSISTENT option). Leaving ... ah there is a function to cleanly unregister them, just by the orte_wait_init, with a very clear name: orte_wait_finalize. Wonderful, except that in the case of a tool this is never called. Strange isn't it that no other components in the ompi tree exhibit such a behavior. Maybe grep can help ... There we are:

[bosilca@dancer ompi]$ find . -name "*.c" -exec grep -Hn orte_wait_finalize {} \;
./orte/mca/ess/hnp/ess_hnp_module.c:486:    orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_app.c:222:    orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_orted.c:310:    orte_wait_finalize();
./orte/runtime/orte_wait.c:280:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:872:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:1182:orte_wait_finalize(void)

This clearly show that with the exception of the tools everybody else clear their state before leaving. And here we are, a quick patch that really fix the problem without removing code that had a really good reason to be there.

Index: orte/mca/ess/base/ess_base_std_tool.c
===================================================================
--- orte/mca/ess/base/ess_base_std_tool.c       (revision 20564)
+++ orte/mca/ess/base/ess_base_std_tool.c       (working copy)
@@ -158,6 +158,8 @@

 int orte_ess_base_tool_finalize(void)
 {
+    orte_wait_finalize();
+
     /* if I am a tool, then all I will have done is
      * a very small subset of orte_init - ensure that
      * I only back those elements out


  george.


On Feb 16, 2009, at 12:57 , Josh Hursey wrote:

This commit seems to have broken the tools. If I use orte-ps then on finalize I get an abort() with the following stack:

shell$ orte-ps
...
(gdb) bt
#0  0x00002aaaabcee155 in raise () from /lib64/libc.so.6
#1  0x00002aaaabcefbf0 in abort () from /lib64/libc.so.6
#2  0x00002aaaabce75d6 in __assert_fail () from /lib64/libc.so.6
#3 0x00002aaaaaf734e1 in opal_evsignal_dealloc (base=0x609f50) at signal.c:295 #4 0x00002aaaaaf73f36 in poll_dealloc (base=0x609f50, arg=0x60a9a0) at poll.c:390 #5 0x00002aaaaaf70667 in opal_event_base_free (base=0x609f50) at event.c:530
#6  0x00002aaaaaf70519 in opal_event_fini () at event.c:390
#7 0x00002aaaaaf5f624 in opal_finalize () at runtime/ opal_finalize.c:117 #8 0x00002aaaaacd4fc4 in orte_finalize () at runtime/ orte_finalize.c:84 #9 0x000000000040196a in main (argc=1, argv=0x7fffffffdf38) at orte-ps.c:275

Any thoughts on why this is happening for only the tools case?

-- Josh

On Feb 14, 2009, at 4:51 PM, bosi...@osl.iu.edu wrote:

Author: bosilca
Date: 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
New Revision: 20562
URL: https://svn.open-mpi.org/trac/ompi/changeset/20562

Log:
Release the default base on finalize.

Text files modified:
 trunk/opal/event/event.c |     4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

Modified: trunk/opal/event/event.c
==================================================================== ==========
--- trunk/opal/event/event.c    (original)
+++ trunk/opal/event/event.c 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
@@ -386,6 +386,10 @@
   if (NULL != opal_event_module_include) {
       opal_argv_free(opal_event_module_include);
   }
+    if( NULL != opal_current_base ) {
+        event_base_free(opal_current_base);
+        opal_current_base = NULL;
+    }
   return OPAL_SUCCESS;
}

_______________________________________________
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to