I'm not sure that was created by the commit you cite, but it may have been 
exposed by it. Either way, the patch is correct - the TCP component will NULL 
the entry in the hash table, but that doesn't remove the key and so the 
hash_table lookup request will return "success" with a NULL pointer.


On Jun 8, 2014, at 10:24 PM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:

> Folks,
> 
> several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a 
> similar stack trace.
> 
> For example, you can refer to :
> http://mtt.open-mpi.org/index.php?do_redir=2199
> 
> the issue is not related whatsoever to the init_thread_serialized test
> (other tests failed with similar symptoms)
> 
> so far i could find that :
> - the issue is intermittent and can be hard to reproduce (1 failure over 1000 
> runs)
> - per the mtt logs, it seems this is quite a recent failure
> - a necessary condition is that MPI tasks exit with a non zero status after 
> having called MPI_Finalize()
> - the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when 
> invoking
> OBJ_RELEASE(value) ;
> in some rare cases, value is NULL which causes the crash.
> - though i cannot incriminate one changeset in particular, i highly suspect 
> the changes that were made in order to address the issue(s) discussed at 
> http://www.open-mpi.org/community/lists/devel/2014/05/14908.php
> 
> the attached a patch that works around this issue.
> i did not commit it because i consider this as a work around and not as a fix 
> :
> the root cause might be a tricky race condition ("abort" after MPI_Finalize).
> 
> 
> as a side note, here is the definition of OBJ_RELEASE 
> (opal/class/opal_object.h)
> #if OPAL_ENABLE_DEBUG
> #define OBJ_RELEASE(object)                                             \
>     do {                                                                \
>         assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>         assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
> (object))->obj_magic_id); \
>     } while (0)
> ...
> #else
> ...
> 
> should we add the following assert at the beginning ?
> assert(NULL != object);
> 
> 
> Thanks in advance for your comments,
> 
> Gilles
> <oob.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14994.php

Reply via email to