Rats - sent too soon. Should have noted that I committed the fix and CMR'd it 
to 1.8.2

On Jun 9, 2014, at 10:47 AM, Ralph Castain <r...@open-mpi.org> wrote:

> I'm not sure that was created by the commit you cite, but it may have been 
> exposed by it. Either way, the patch is correct - the TCP component will NULL 
> the entry in the hash table, but that doesn't remove the key and so the 
> hash_table lookup request will return "success" with a NULL pointer.
> 
> 
> On Jun 8, 2014, at 10:24 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Folks,
>> 
>> several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a 
>> similar stack trace.
>> 
>> For example, you can refer to :
>> http://mtt.open-mpi.org/index.php?do_redir=2199
>> 
>> the issue is not related whatsoever to the init_thread_serialized test
>> (other tests failed with similar symptoms)
>> 
>> so far i could find that :
>> - the issue is intermittent and can be hard to reproduce (1 failure over 
>> 1000 runs)
>> - per the mtt logs, it seems this is quite a recent failure
>> - a necessary condition is that MPI tasks exit with a non zero status after 
>> having called MPI_Finalize()
>> - the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when 
>> invoking
>> OBJ_RELEASE(value) ;
>> in some rare cases, value is NULL which causes the crash.
>> - though i cannot incriminate one changeset in particular, i highly suspect 
>> the changes that were made in order to address the issue(s) discussed at 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14908.php
>> 
>> the attached a patch that works around this issue.
>> i did not commit it because i consider this as a work around and not as a 
>> fix :
>> the root cause might be a tricky race condition ("abort" after MPI_Finalize).
>> 
>> 
>> as a side note, here is the definition of OBJ_RELEASE 
>> (opal/class/opal_object.h)
>> #if OPAL_ENABLE_DEBUG
>> #define OBJ_RELEASE(object)                                             \
>>     do {                                                                \
>>         assert(NULL != ((opal_object_t *) (object))->obj_class);        \
>>         assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
>> (object))->obj_magic_id); \
>>     } while (0)
>> ...
>> #else
>> ...
>> 
>> should we add the following assert at the beginning ?
>> assert(NULL != object);
>> 
>> 
>> Thanks in advance for your comments,
>> 
>> Gilles
>> <oob.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14994.php
> 

Reply via email to