FYI: this has been fixed and the temporary patch removed. Turned out to be a 
problem with progress threads not getting completely cleaned up prior to exit, 
resulting in multiple threads executing opal_finalize.


On Dec 24, 2012, at 10:43 AM, Ralph Castain <r...@open-mpi.org> wrote:

> FWIW: I have installed a temporary patch that allows the trunk to run by no 
> longer finalizing OPAL. Once the param system has been repaired, this will be 
> removed. Meantime, at least you can run the trunk.
> 
> On Dec 24, 2012, at 10:39 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Hi folks
>> 
>> This is a heads-up to all: It appears a recent commit has broken the trunk - 
>> I think it relates to something done to the MCA parameter system. When 
>> running across multiple nodes, the daemons segfault on finalize with a 
>> stacktrace of:
>> 
>> (gdb) where
>> #0  0x0000003dc4477e92 in _int_free () from /lib64/libc.so.6
>> #1  0x00007f18a163f756 in param_destructor (p=0x118d940) at 
>> mca_base_param.c:1982
>> #2  0x00007f18a163ab41 in opal_obj_run_destructors (object=0x118d940) at 
>> ../../../opal/class/opal_object.h:448
>> #3  0x00007f18a163cb94 in mca_base_param_finalize () at mca_base_param.c:853
>> #4  0x00007f18a1609c06 in opal_finalize_util () at runtime/opal_finalize.c:69
>> #5  0x00007f18a1609cbc in opal_finalize () at runtime/opal_finalize.c:155
>> #6  0x00007f18a18e366b in orte_finalize () at runtime/orte_finalize.c:107
>> #7  0x00007f18a1911313 in orte_daemon (argc=35, argv=0x7ffffd7ea8b8) at 
>> orted/orted_main.c:834
>> #8  0x000000000040091a in main (argc=35, argv=0x7ffffd7ea8b8) at orted.c:62
>> (gdb) up
>> #1  0x00007f18a163f756 in param_destructor (p=0x118d940) at 
>> mca_base_param.c:1982
>> 1982         free(p->mbp_env_var_name);
>> 
>> gdb) print array[i]
>> $2 = {mbp_super = {obj_magic_id = 0, obj_class = 0x7f18a18c6460, 
>> obj_reference_count = 1, cls_init_file_name = 0x7f18a169d04e 
>> "mca_base_param.c", 
>>   cls_init_lineno = 1154}, mbp_type = MCA_BASE_PARAM_TYPE_STRING, 
>> mbp_type_name = 0x1185110 "\300O\030\001", mbp_component_name = 0x0, 
>> mbp_param_name = 0x1185130 "", mbp_full_name = 0x1185150 
>> "orte_debugger_test_daemon", mbp_synonyms = 0x0, mbp_internal = false, 
>> mbp_read_only = false, mbp_deprecated = false, mbp_deprecated_warning_shown 
>> = true, 
>> mbp_help_msg = 0x11850a0 "Name of the executable to be used to simulate a 
>> debugger colaunch (relative or absolute path)", 
>> mbp_env_var_name = 0x1185180 "\020P\030\001", mbp_default_value = {intval = 
>> 0, stringval = 0x0}, mbp_file_value_set = false, mbp_file_value = {
>>   intval = 0, stringval = 0x0}, mbp_source_file = 0x0, 
>> mbp_override_value_set = false, mbp_override_value = {intval = 0, stringval 
>> = 0x0}}
>> 
>> As you can see, the problem is that the mbp_env_var_name field is trash, so 
>> the destructor's attempt to free that field crashes.
>> 
>> I believe it was Nathan that last touched this area, so perhaps he could 
>> take a gander and see what happened? Meantime, I'm afraid the trunk is down.
>> 
>> Thanks
>> Ralph
>> 
> 


Reply via email to