Re: [OMPI devel] race condition in finalize

Ralph Castain Wed, 22 Jul 2015 09:48:01 -0400 (EDT)

Thanks Gilles!

The “clean” event doesn’t have to go last - any messages that arrive after all 
recvs have been removed will simply be dropped upon termination. This commit 
only ensures that the list of posted recvs is cleanly destructed, which will 
prevent the segfault.



> On Jul 22, 2015, at 1:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Thanks Ralph,
> 
> i was unable to reproduce any crash with this fix :-)
> 
> i checked the code that is invoked in the progress thread, and it might queue 
> other events.
> bottom line, i am not 100% convinced the "clean" event will be executed at 
> the very last.
> 
> that being said, and once again, i was unable to reproduce any crash.
> 
> Cheers,
> 
> Gilles
> 
> On 7/22/2015 12:48 AM, Ralph Castain wrote:
>> I believe I have this fixed - please see if this solves the problem:
>> 
>> https://github.com/open-mpi/ompi/pull/730 
>> <https://github.com/open-mpi/ompi/pull/730>
>> 
>> 
>>> On Jul 21, 2015, at 12:22 AM, Gilles Gouaillardet <gil...@rist.or.jp 
>>> <mailto:gil...@rist.or.jp>> wrote:
>>> 
>>> Ralph,
>>> 
>>> here is some more detailed information.
>>> 
>>> 
>>> from orte_ess_base_app_finalize()
>>> first orte_rml_base_close() is invoked(via 
>>> mca_base_framework_close(&orte_rml_base_framework);
>>> and it does
>>>     while (NULL != (item = 
>>> opal_list_remove_first(&orte_rml_base.posted_recvs))) {
>>>         OBJ_RELEASE(item);
>>>     }
>>> then, opal_stop_progress_thread() is invoked
>>> 
>>> that means that when orte_rml_base_close is invoked, the progress thread is 
>>> up and running,
>>> and can potentially invoke orte_rml_base_post_recv that does
>>> 
>>>     if (req->cancel) {
>>>         OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs, 
>>> orte_rml_posted_recv_t) {
>>>             if (OPAL_EQUAL == orte_util_compare_name_fields(mask, 
>>> &post->peer, &recv->peer) &&
>>>                 post->tag == recv->tag) {
>>>                 opal_output_verbose(5, 
>>> orte_rml_base_framework.framework_output,
>>>                                     "%s canceling recv %d for peer %s",
>>>                                     ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>>>                                     post->tag, 
>>> ORTE_NAME_PRINT(&recv->peer));
>>>                 /* got a match - remove it */
>>>                 opal_list_remove_item(&orte_rml_base.posted_recvs, 
>>> &recv->super);
>>>                 OBJ_RELEASE(recv);
>>>                 break;
>>>             }
>>>         }
>>>         OBJ_RELEASE(req);
>>>         return;
>>>     }
>>> 
>>> /* this is where the assert fails */
>>> 
>>> since there is no lock, there is room for a race condition.
>>> 
>>> 
>>> before all that happen, and in orte_ess_base_app_finalize(), 
>>> mca_base_framework_close(&orte_grpcomm_base_framework) invokes finalize 
>>> from grpcomm_rcd.c
>>> that does
>>> orte_rml.recv_cancel(ORTE_NAME_WILDCARD, ORTE_RML_TAG_ALLGATHER_RCD) that is
>>> orte_rml_oob_recv_cancel that ends up doing opal_event_set(..., 
>>> orte_rml_base_post_recv)
>>> 
>>> 
>>> my first and naive attempt was to stop the opal_async progress thread 
>>> before closing the rml_base framework:
>>> diff --git a/orte/mca/ess/base/ess_base_std_app.c 
>>> b/orte/mca/ess/base/ess_base_std_app.c
>>> index d3bc6e6..4e09b47 100644
>>> --- a/orte/mca/ess/base/ess_base_std_app.c
>>> +++ b/orte/mca/ess/base/ess_base_std_app.c
>>> @@ -353,18 +353,18 @@ int orte_ess_base_app_finalize(void)
>>>      (void) mca_base_framework_close(&orte_dfs_base_framework);
>>>      (void) mca_base_framework_close(&orte_routed_base_framework);
>>>  
>>> -    (void) mca_base_framework_close(&orte_rml_base_framework);
>>> -    (void) mca_base_framework_close(&orte_oob_base_framework);
>>> -    (void) mca_base_framework_close(&orte_state_base_framework);
>>> -
>>> -    orte_session_dir_finalize(ORTE_PROC_MY_NAME);
>>> -
>>>      /* release the event base */
>>>      if (progress_thread_running) {
>>>          opal_stop_progress_thread("opal_async", true);
>>>          progress_thread_running = false;
>>>      }
>>>  
>>> +    (void) mca_base_framework_close(&orte_rml_base_framework);
>>> +    (void) mca_base_framework_close(&orte_oob_base_framework);
>>> +    (void) mca_base_framework_close(&orte_state_base_framework);
>>> +
>>> +    orte_session_dir_finalize(ORTE_PROC_MY_NAME);
>>> +
>>>      return ORTE_SUCCESS;
>>>  }
>>> 
>>> that did not work : the opal_async progress thread is also used by pmix, so 
>>> at this stage,
>>> invoking opal_stop_progress_thread only decrements the refcount (e.g. no 
>>> pthread_join() )
>>> 
>>> my second and dumb attempt was to finalize pmix before ess_base_app, and 
>>> that did not work
>>> (crash)
>>> 
>>> i ran out of idea on how to fix this issue, but i found a simple workaround 
>>> :
>>> adding a short sleep (10 ms) in orte_rml_base_close() seems enough to avoid 
>>> the race condition.
>>> 
>>> diff --git a/orte/mca/rml/base/rml_base_frame.c 
>>> b/orte/mca/rml/base/rml_base_frame.c
>>> index 54dc454..050154c 100644
>>> --- a/orte/mca/rml/base/rml_base_frame.c
>>> +++ b/orte/mca/rml/base/rml_base_frame.c
>>> @@ -17,6 +17,7 @@
>>>  
>>>  #include "orte_config.h"
>>>  
>>> +#include <sys/poll.h>
>>>  #include <string.h>
>>>  
>>>  #include "opal/dss/dss.h"
>>> @@ -78,6 +79,7 @@ static int orte_rml_base_close(void)
>>>  {
>>>      opal_list_item_t *item;
>>>  
>>> +    poll(NULL,0,10);
>>>      while (NULL != (item = 
>>> opal_list_remove_first(&orte_rml_base.posted_recvs))) {
>>>          OBJ_RELEASE(item);
>>>      }
>>> 
>>> incidentally, i found two OPAL_LIST_FOREACH "loops" in which 
>>> opal_list_remove_item is invoked.
>>> per a comment in opal_list.h, this is unsafe and OPAL_LIST_FOREACH_SAFE 
>>> should be used :
>>> 
>>>  diff --git a/orte/mca/rml/base/rml_base_msg_handlers.c 
>>> b/orte/mca/rml/base/rml_base_msg_handlers.c
>>> index 758bf91..22c7601 100644
>>> --- a/orte/mca/rml/base/rml_base_msg_handlers.c
>>> +++ b/orte/mca/rml/base/rml_base_msg_handlers.c
>>> @@ -12,7 +12,9 @@
>>>   *                         All rights reserved.
>>>   * Copyright (c) 2007-2013 Los Alamos National Security, LLC.  All rights
>>>   *                         reserved.
>>> - * Copyright (c) 2015 Intel, Inc. All rights reserved.
>>> + * Copyright (c) 2015      Intel, Inc. All rights reserved.
>>> + * Copyright (c) 2015      Research Organization for Information Science
>>> + *                         and Technology (RIST). All rights reserved.
>>>   * $COPYRIGHT$
>>>   *
>>>   * Additional copyrights may follow
>>> @@ -55,7 +57,7 @@ static void msg_match_recv(orte_rml_posted_recv_t *rcv, 
>>> bool get_all);
>>>  void orte_rml_base_post_recv(int sd, short args, void *cbdata)
>>>  {
>>>      orte_rml_recv_request_t *req = (orte_rml_recv_request_t*)cbdata;
>>> -    orte_rml_posted_recv_t *post, *recv;
>>> +    orte_rml_posted_recv_t *post, *recv, *next;
>>>      orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD;
>>>  
>>>      opal_output_verbose(5, orte_rml_base_framework.framework_output,
>>> @@ -75,7 +77,7 @@ void orte_rml_base_post_recv(int sd, short args, void 
>>> *cbdata)
>>>       * and remove it from our list
>>>       */
>>>      if (req->cancel) {
>>> -        OPAL_LIST_FOREACH(recv, &orte_rml_base.posted_recvs, 
>>> orte_rml_posted_recv_t) {
>>> +        OPAL_LIST_FOREACH_SAFE(recv, next, &orte_rml_base.posted_recvs, 
>>> orte_rml_posted_recv_t) {
>>>              if (OPAL_EQUAL == orte_util_compare_name_fields(mask, 
>>> &post->peer, &recv->peer) &&
>>>                  post->tag == recv->tag) {
>>>                  opal_output_verbose(5, 
>>> orte_rml_base_framework.framework_output,
>>> @@ -120,12 +122,12 @@ void orte_rml_base_post_recv(int sd, short args, void 
>>> *cbdata)
>>>  
>>>  void orte_rml_base_complete_recv_msg (orte_rml_recv_t **recv_msg)
>>>  {
>>> -    orte_rml_posted_recv_t *post;
>>> +    orte_rml_posted_recv_t *post, *next;
>>>      orte_ns_cmp_bitmask_t mask = ORTE_NS_CMP_ALL | ORTE_NS_CMP_WILD;
>>>      opal_buffer_t buf;
>>>      orte_rml_recv_t *msg = *recv_msg;
>>>      /* see if we have a waiting recv for this message */
>>> -    OPAL_LIST_FOREACH(post, &orte_rml_base.posted_recvs, 
>>> orte_rml_posted_recv_t) {
>>> +    OPAL_LIST_FOREACH_SAFE(post, next, &orte_rml_base.posted_recvs, 
>>> orte_rml_posted_recv_t) {
>>>          /* since names could include wildcards, must use
>>>           * the more generalized comparison function
>>>           */
>>> 
>>> i hope this helps,
>>> 
>>> Gilles
>>> 
>>> On 7/17/2015 11:04 PM, Ralph Castain wrote:
>>>> It’s probably a race condition caused by uniting the ORTE and OPAL async 
>>>> threads, though I can’t confirm that yet.
>>>> 
>>>>> On Jul 17, 2015, at 3:11 AM, Gilles Gouaillardet 
>>>>> <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> 
>>>>> wrote:
>>>>> 
>>>>> Folks,
>>>>> 
>>>>> I noticed several errors such as 
>>>>> http://mtt.open-mpi.org/index.php?do_redir=2244 
>>>>> <http://mtt.open-mpi.org/index.php?do_redir=2244>
>>>>> that did not make any sense to me (at first glance)
>>>>> 
>>>>> I was able to attach one process when the issue occurs.
>>>>> the sigsegv occurs in thread 2, while thread 1 is invoking 
>>>>> ompi_rte_finalize.
>>>>> 
>>>>> All I can think is a scenario in which the progress thread (aka thread 2) 
>>>>> is still dealing with some memory that was just freed/unmapped/corrupted 
>>>>> by the main thread.
>>>>> 
>>>>> I empirically noticed the error is more likely to occur when there are 
>>>>> many tasks on one node
>>>>> e.g. mpirun --oversubscribe -np 32 a.out
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2015/07/17652.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2015/07/17652.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2015/07/17655.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2015/07/17655.php>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/07/17668.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/07/17668.php>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/07/17669.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/07/17669.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/07/17678.php

Re: [OMPI devel] race condition in finalize

Reply via email to