Never mind -- you just did. Thanks! :-)
On Feb 16, 2009, at 3:07 PM, Jeff Squyres wrote:
George --
Will you commit?
On Feb 16, 2009, at 2:59 PM, George Bosilca wrote:
Josh,
Spending few minutes to understand, could have pinpointed you to
the real culprit: the tool itself!
The assert in the code state that on finalize there is still a
registered signal handler. A quick gdb show that this is for the
SIG_CHLD. Tracking the signal addition in the tool (breakpoint in
gdb on opal_event_queue_insert) clearly highlight the place where
this happens, i.e. orte_wait_init in orte/runtime/orte_wait.c:274.
So far so good, we're right of tracking the SIG_CHLD, but we're not
supposed to leave it there when we're done (as the signal is
registered with the PERSISTENT option). Leaving ... ah there is a
function to cleanly unregister them, just by the orte_wait_init,
with a very clear name: orte_wait_finalize. Wonderful, except that
in the case of a tool this is never called. Strange isn't it that
no other components in the ompi tree exhibit such a behavior. Maybe
grep can help ... There we are:
[bosilca@dancer ompi]$ find . -name "*.c" -exec grep -Hn
orte_wait_finalize {} \;
./orte/mca/ess/hnp/ess_hnp_module.c:486: orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_app.c:222: orte_wait_finalize();
./orte/mca/ess/base/ess_base_std_orted.c:310:
orte_wait_finalize();
./orte/runtime/orte_wait.c:280:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:872:orte_wait_finalize(void)
./orte/runtime/orte_wait.c:1182:orte_wait_finalize(void)
This clearly show that with the exception of the tools everybody
else clear their state before leaving. And here we are, a quick
patch that really fix the problem without removing code that had a
really good reason to be there.
Index: orte/mca/ess/base/ess_base_std_tool.c
===================================================================
--- orte/mca/ess/base/ess_base_std_tool.c (revision 20564)
+++ orte/mca/ess/base/ess_base_std_tool.c (working copy)
@@ -158,6 +158,8 @@
int orte_ess_base_tool_finalize(void)
{
+ orte_wait_finalize();
+
/* if I am a tool, then all I will have done is
* a very small subset of orte_init - ensure that
* I only back those elements out
george.
On Feb 16, 2009, at 12:57 , Josh Hursey wrote:
This commit seems to have broken the tools. If I use orte-ps then
on finalize I get an abort() with the following stack:
shell$ orte-ps
...
(gdb) bt
#0 0x00002aaaabcee155 in raise () from /lib64/libc.so.6
#1 0x00002aaaabcefbf0 in abort () from /lib64/libc.so.6
#2 0x00002aaaabce75d6 in __assert_fail () from /lib64/libc.so.6
#3 0x00002aaaaaf734e1 in opal_evsignal_dealloc (base=0x609f50) at
signal.c:295
#4 0x00002aaaaaf73f36 in poll_dealloc (base=0x609f50,
arg=0x60a9a0) at poll.c:390
#5 0x00002aaaaaf70667 in opal_event_base_free (base=0x609f50) at
event.c:530
#6 0x00002aaaaaf70519 in opal_event_fini () at event.c:390
#7 0x00002aaaaaf5f624 in opal_finalize () at runtime/
opal_finalize.c:117
#8 0x00002aaaaacd4fc4 in orte_finalize () at runtime/
orte_finalize.c:84
#9 0x000000000040196a in main (argc=1, argv=0x7fffffffdf38) at
orte-ps.c:275
Any thoughts on why this is happening for only the tools case?
-- Josh
On Feb 14, 2009, at 4:51 PM, bosi...@osl.iu.edu wrote:
Author: bosilca
Date: 2009-02-14 16:51:09 EST (Sat, 14 Feb 2009)
New Revision: 20562
URL: https://svn.open-mpi.org/trac/ompi/changeset/20562
Log:
Release the default base on finalize.
Text files modified:
trunk/opal/event/event.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
Modified: trunk/opal/event/event.c
=
=
=
=
=
=
=
=
=
=
=
===================================================================
--- trunk/opal/event/event.c (original)
+++ trunk/opal/event/event.c 2009-02-14 16:51:09 EST (Sat, 14 Feb
2009)
@@ -386,6 +386,10 @@
if (NULL != opal_event_module_include) {
opal_argv_free(opal_event_module_include);
}
+ if( NULL != opal_current_base ) {
+ event_base_free(opal_current_base);
+ opal_current_base = NULL;
+ }
return OPAL_SUCCESS;
}
_______________________________________________
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems