Folks,

several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a
similar stack trace.

For example, you can refer to :
http://mtt.open-mpi.org/index.php?do_redir=2199

the issue is not related whatsoever to the init_thread_serialized test
(other tests failed with similar symptoms)

so far i could find that :
- the issue is intermittent and can be hard to reproduce (1 failure over
1000 runs)
- per the mtt logs, it seems this is quite a recent failure
- a necessary condition is that MPI tasks exit with a non zero status after
having called MPI_Finalize()
- the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when
invoking
OBJ_RELEASE(value) ;
in some rare cases, value is NULL which causes the crash.
- though i cannot incriminate one changeset in particular, i highly suspect
the changes that were made in order to address the issue(s) discussed at
http://www.open-mpi.org/community/lists/devel/2014/05/14908.php

the attached a patch that works around this issue.
i did not commit it because i consider this as a work around and not as a
fix :
the root cause might be a tricky race condition ("abort" after
MPI_Finalize).


as a side note, here is the definition of OBJ_RELEASE
(opal/class/opal_object.h)
#if OPAL_ENABLE_DEBUG
#define OBJ_RELEASE(object)                                             \
    do {                                                                \
        assert(NULL != ((opal_object_t *) (object))->obj_class);        \
        assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *)
(object))->obj_magic_id); \
    } while (0)
...
#else
...

should we add the following assert at the beginning ?
assert(NULL != object);


Thanks in advance for your comments,

Gilles
Index: orte/mca/oob/base/oob_base_frame.c
===================================================================
--- orte/mca/oob/base/oob_base_frame.c	(revision 31967)
+++ orte/mca/oob/base/oob_base_frame.c	(working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2007      Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2013-2014 Los Alamos National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -86,7 +88,11 @@
     rc = opal_hash_table_get_first_key_uint64 (&orte_oob_base.peers, &key,
                                                (void **) &value, &node);
     while (OPAL_SUCCESS == rc) {
-        OBJ_RELEASE(value);
+        /* in some rare cases, value can be NULL.
+           this would cause a crash in OBJ_RELEASE */
+        if (NULL != value) {
+            OBJ_RELEASE(value);
+        }
         rc = opal_hash_table_get_next_key_uint64 (&orte_oob_base.peers, &key,
                                                   (void **) &value, node, &node);
     }

Reply via email to