Folks, several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a similar stack trace.
For example, you can refer to : http://mtt.open-mpi.org/index.php?do_redir=2199 the issue is not related whatsoever to the init_thread_serialized test (other tests failed with similar symptoms) so far i could find that : - the issue is intermittent and can be hard to reproduce (1 failure over 1000 runs) - per the mtt logs, it seems this is quite a recent failure - a necessary condition is that MPI tasks exit with a non zero status after having called MPI_Finalize() - the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when invoking OBJ_RELEASE(value) ; in some rare cases, value is NULL which causes the crash. - though i cannot incriminate one changeset in particular, i highly suspect the changes that were made in order to address the issue(s) discussed at http://www.open-mpi.org/community/lists/devel/2014/05/14908.php the attached a patch that works around this issue. i did not commit it because i consider this as a work around and not as a fix : the root cause might be a tricky race condition ("abort" after MPI_Finalize). as a side note, here is the definition of OBJ_RELEASE (opal/class/opal_object.h) #if OPAL_ENABLE_DEBUG #define OBJ_RELEASE(object) \ do { \ assert(NULL != ((opal_object_t *) (object))->obj_class); \ assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (object))->obj_magic_id); \ } while (0) ... #else ... should we add the following assert at the beginning ? assert(NULL != object); Thanks in advance for your comments, Gilles
Index: orte/mca/oob/base/oob_base_frame.c =================================================================== --- orte/mca/oob/base/oob_base_frame.c (revision 31967) +++ orte/mca/oob/base/oob_base_frame.c (working copy) @@ -13,6 +13,8 @@ * Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2013-2014 Los Alamos National Security, LLC. All rights * reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -86,7 +88,11 @@ rc = opal_hash_table_get_first_key_uint64 (&orte_oob_base.peers, &key, (void **) &value, &node); while (OPAL_SUCCESS == rc) { - OBJ_RELEASE(value); + /* in some rare cases, value can be NULL. + this would cause a crash in OBJ_RELEASE */ + if (NULL != value) { + OBJ_RELEASE(value); + } rc = opal_hash_table_get_next_key_uint64 (&orte_oob_base.peers, &key, (void **) &value, node, &node); }