Rats - sent too soon. Should have noted that I committed the fix and CMR'd it to 1.8.2
On Jun 9, 2014, at 10:47 AM, Ralph Castain <r...@open-mpi.org> wrote: > I'm not sure that was created by the commit you cite, but it may have been > exposed by it. Either way, the patch is correct - the TCP component will NULL > the entry in the hash table, but that doesn't remove the key and so the > hash_table lookup request will return "success" with a NULL pointer. > > > On Jun 8, 2014, at 10:24 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Folks, >> >> several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a >> similar stack trace. >> >> For example, you can refer to : >> http://mtt.open-mpi.org/index.php?do_redir=2199 >> >> the issue is not related whatsoever to the init_thread_serialized test >> (other tests failed with similar symptoms) >> >> so far i could find that : >> - the issue is intermittent and can be hard to reproduce (1 failure over >> 1000 runs) >> - per the mtt logs, it seems this is quite a recent failure >> - a necessary condition is that MPI tasks exit with a non zero status after >> having called MPI_Finalize() >> - the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when >> invoking >> OBJ_RELEASE(value) ; >> in some rare cases, value is NULL which causes the crash. >> - though i cannot incriminate one changeset in particular, i highly suspect >> the changes that were made in order to address the issue(s) discussed at >> http://www.open-mpi.org/community/lists/devel/2014/05/14908.php >> >> the attached a patch that works around this issue. >> i did not commit it because i consider this as a work around and not as a >> fix : >> the root cause might be a tricky race condition ("abort" after MPI_Finalize). >> >> >> as a side note, here is the definition of OBJ_RELEASE >> (opal/class/opal_object.h) >> #if OPAL_ENABLE_DEBUG >> #define OBJ_RELEASE(object) \ >> do { \ >> assert(NULL != ((opal_object_t *) (object))->obj_class); \ >> assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) >> (object))->obj_magic_id); \ >> } while (0) >> ... >> #else >> ... >> >> should we add the following assert at the beginning ? >> assert(NULL != object); >> >> >> Thanks in advance for your comments, >> >> Gilles >> <oob.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14994.php >