Hi Lennart,

Thanks for your proposal.

Still, there is race condition b/w mds thread and recover/client thread, I 
think.
1) Headless while recover thread 2 is running
2) Headless while the client thread is accessing/modifying to client list.

I would like to propose other one, using only one mutex cb_lock @lga_cb
to protect common `lga_cb` resource, including `state`, `client` list, etc. in 
three threads.

1) Client thread
- Lock mutex before getting the state
- Unlock right before the API operation is finished

2) Recovery2 thread
- Lock mutex after timeout
- Unlock before ending the thread

3) MDS thread
- Lock/unlock before/after changing state
In case of headless:
- Lock/unlock mutex before modifying `client_list` in lga_no_server_state_set()

What is your opinion? Do you think there is impact to log client if they call 
LOG API in multi threads? 

Regards, Vu.


>-----Original Message-----
>From: Lennart Lund [mailto:lennart.l...@ericsson.com]
>Sent: Monday, April 04, 2016 3:51 PM
>To: Vu Minh Nguyen
>Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; Lennart
>Lund
>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in
>log agent [#1705]
>
>Hi Vu,
>
>A simpler solution than described below, maybe:
>
>Still use a separate "recovery2" mutex but a "normal one".
> - In the API, lock the mutex just before checking if we are in recovery state 
> 2
>(before "if (lga_state == LGA_RECOVERY2) {")
> - In the API, do not unlock until the API operation is finished
> - In the recovery thread lock the mutex just before  changing the recovery
>state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2 mutex)
>(protect the lga_state variable)
>- In the recovery thread the mutex shall be unlocked just after lga_state is 
>set
>to recovery state 2.
>
>This means that the recovery thread cannot set recovery state2 or execute
>until an eventual ongoing API operation is done.
>If the thread starts and set recovery state just before the API checks recovery
>state the API will correctly see that we are in recovery state 2 and handle 
>thing
>accordingly.
>
>Regards
>Lennart
>
>
>> -----Original Message-----
>> From: Lennart Lund
>> Sent: den 1 april 2016 16:51
>> To: Vu Minh Nguyen
>> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net;
>> Lennart Lund
>> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
>> in log agent [#1705]
>>
>> Hi Vu,
>>
>> You are right, good finding, this may happen since the recovery2 thread is
>> started and is waiting for a timeout. If this timeout happen at the "wrong
>> time" bad things may happen.
>> However the actual problem should be solved.
>>
>> This is how it could be done (maybe you can find a better way?):
>> When timeout the recovery thread shall set recovery state 2 as is but shall
>> not start any recovery if there is any ongoing client activity.
>> This could be solved by using a "recovery2" mutex (not the existing
>> lga_cb.cb_lock).
>>  - The recovery thread lock this mutex after setting recovery state 2.
>>  - During recovery state 1 all APIs check if the mutex is taken and if so 
>> return
>> TRY AGAIN (this means that we are in recovery state 2 but the state change
>> may have happened after the check in the API). Note: "trylock" cannot be
>> used for this purpose since a sequence trylock -> lock is not atomic. Maybe a
>> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used.
>>  - If the mutex is not taken it shall be locked by the API function and be
>> unlocked just before the API function returns preventing the recovery
>> thread from starting recovery during an ongoing API operation.
>> The next time an API is called recovery state 2 is set and the API function 
>> will
>> operate accordingly.
>>
>> Thanks
>> Lennart
>>
>> > -----Original Message-----
>> > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> > Sent: den 1 april 2016 16:02
>> > To: Lennart Lund
>> > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
>> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>> resource
>> > in log agent [#1705]
>> >
>> > Hi Lennart,
>> >
>> > Is it possible for following case happens?
>> >
>> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG
>> > service up, recovery thread starts and
>> > comes to the point right after "while (pclient != NULL)" in
>> > recovery2_thread @lga_state.c
>> > Then, the context is switching to the client thread and it deletes the
>> > resource while recovery2_thread is running?
>> >
>> > Regards, Vu.
>> >
>> > Quoting Lennart Lund <lennart.l...@ericsson.com>:
>> >
>> > > Hi Vu,
>> > >
>> > >                  /* Recover clients one at a time */
>> > >                  rc = recover_one_client(p_client);
>> > >  +               if (p_client == NULL) goto done;
>> > >  +
>> > >  [Lennart] Why is this needed? This check is done in "while
>> > > (p_client != NULL) { "
>> > > and p_client cannot become NULL until "p_client = p_client->next; "
>> > > is done as
>> > > the last line in the while loop.
>> > >
>> > >  [Vu] It is my intention. I am afraid if the context is switched to
>> > >  client thread
>> > >  right after the recovery thread enters the while loop.
>> > >  If it is the case, p_client will be NULL after done 
>> > > recover_one_client().
>> > >
>> > > [Lennart] I don't really understand your explanation. What is
>> > > happening in the client thread when we are in recovery state 2 that
>> > > can set p_client == NULL? The client thread should not be able to do
>> > > anything with the data involved when recovering client in the
>> > > recovery thread?
>> > >
>> > > Regards
>> > > Lennart
>> > >
>> > >> -----Original Message-----
>> > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> > >> Sent: den 1 april 2016 14:14
>> > >> To: Lennart Lund
>> > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
>> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>> > resource
>> > >> in log agent [#1705]
>> > >>
>> > >> Hi Lennart,
>> > >>
>> > >> Please see my comment inline.
>> > >>
>> > >> Regards, Vu
>> > >>
>> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>:
>> > >>
>> > >> > Hi Vu
>> > >> >
>> > >> > Ack with comments
>> > >> >
>> > >> > See my comments inline [Lennart]:
>> > >> >
>> > >> > Thanks
>> > >> > Lennart
>> > >> >
>> > >> >> -----Original Message-----
>> > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> > >> >> Sent: den 23 mars 2016 08:16
>> > >> >> To: mathi.naic...@oracle.com; Lennart Lund
>> > >> >> Cc: opensaf-devel@lists.sourceforge.net
>> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common
>> > resource
>> > >> in
>> > >> >> log agent [#1705]
>> > >> >>
>> > >> >>  osaf/libs/agents/saf/lga/lga_state.c |  41
>> > >> >> +++++++++++++++++++++++++++++++----
>> > >> >>  osaf/libs/agents/saf/lga/lga_state.h |   1 +
>> > >> >>  osaf/libs/agents/saf/lga/lga_util.c  |  30 +++++++++++--------------
>> > >> >>  3 files changed, 50 insertions(+), 22 deletions(-)
>> > >> >>
>> > >> >>
>> > >> >> There was an race condition between client thread and recovery
>> > thread
>> > >> >> accessing to common resource `client_list`.
>> > >> >>
>> > >> >> Add mutex protection.
>> > >> >>
>> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c
>> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c
>> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c
>> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c
>> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void *dumm
>> > >> >>                      /* We have been signaled to exit
>> > >> >> */
>> > >> >>                      goto done;
>> > >> >>              }
>> > >> >> +
>> > >> >>              /* Recover clients one at a time */
>> > >> >>              rc = recover_one_client(p_client);
>> > >> >> +            if (p_client == NULL) goto done;
>> > >> >> +
>> > >> > [Lennart] Why is this needed? This check is done in "while (p_client
>> > >> > != NULL) { " and p_client cannot become NULL until "p_client =
>> > >> > p_client->next; " is done as the last line in the while loop.
>> > >> >
>> > >> [Vu] It is my intention. I am afraid if the context is switched to
>> > >> client thread
>> > >> right after the recovery thread enters the while loop.
>> > >> If it is the case, p_client will be NULL after done 
>> > >> recover_one_client().
>> > >>
>> > >> >>              TRACE("\t Client %d is recovered", p_client-
>> > >> >> >lgs_client_id);
>> > >> >>              if (rc == -1) {
>> > >> >>                      TRACE("%s recover_one_client
>> > >> >> Fail Deleting cllient (id %d)",
>> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re
>> > >> >>
>> > >> >>      TRACE_ENTER();
>> > >> >>
>> > >> >> +    if (p_client == NULL) {
>> > >> >> +            /**
>> > >> >> +             * Probably the client_list has been deleted by
>> > >> >> client thread
>> > >> >> +             * in context calling saLogFinalize(). So, give up
>> > >> >> recovering.
>> > >> >> +             */
>> > >> >> +            rc = -1;
>> > >> >> +            goto done;
>> > >> >> +    }
>> > >> >> +
>> > >> >>      /* This function may be called both in the recovery thread and
>> > >> >> in the
>> > >> >>       * client thread. A possible scenario is that the recovery 2
>> > >> >> thread
>> > >> >>       * times out and calls this function in recovery mode 2 while
>> > >> >> there
>> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client(
>> > >> >>      lga_client_hdl_rec_t *rm_node
>> > >> >>      )
>> > >> >>  {
>> > >> >> -    TRACE_ENTER2();
>> > >> >> -    uint32_t ncs_rc;
>> > >> >> +    return lga_hdl_rec_del(list_head, rm_node);
>> > >> >> +}
>> > >> >>
>> > >> >> +/**
>> > >> >> + * Free the memory allocated for one client handle with mutext
>> > >> protection.
>> > >> >> + *
>> > >> >> + * @param p_client_hdl
>> > >> >> + * @return void
>> > >> >> + */
>> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl)
>> > >> >> +{
>> > >> >> +    lga_client_hdl_rec_t *client_hdl;
>> > >> >> +
>> > >> >> +    /**
>> > >> >> +     * To prevent race condition b/w client thread and recovery
>> > >> >> thread
>> > >> >> +     * accessing to common resource `client_list`. [#1705]
>> > >> >> +     */
>> > >> >>      osaf_mutex_lock_ordie(&lga_recov_mutex);
>> > >> >> -    ncs_rc = lga_hdl_rec_del(list_head, rm_node);
>> > >> >> +
>> > >> >> +    client_hdl = *p_client_hdl;
>> > >> >> +    if (client_hdl == NULL) goto done;
>> > >> >> +
>> > >> >> +    free(client_hdl);
>> > >> >> +    client_hdl = NULL;
>> > >> >> +
>> > >> >> +done:
>> > >> >>      osaf_mutex_unlock_ordie(&lga_recov_mutex);
>> > >> >>
>> > >> >> -    TRACE_LEAVE();
>> > >> >> -    return ncs_rc;
>> > >> >>  }
>> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h
>> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h
>> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h
>> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h
>> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client(
>> > >> >>      lga_client_hdl_rec_t **list_head,
>> > >> >>      lga_client_hdl_rec_t *rm_node
>> > >> >>      );
>> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl);
>> > >> >>
>> > >> >>  #ifdef      __cplusplus
>> > >> >>  }
>> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c
>> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c
>> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c
>> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c
>> > >> >> @@ -19,6 +19,7 @@
>> > >> >>  #include <syslog.h>
>> > >> >>  #include "lga.h"
>> > >> >>  #include "osaf_poll.h"
>> > >> >> +#include "lga_state.h"
>> > >> >>
>> > >> >>  /* Variables used during startup/shutdown only */
>> > >> >>  static pthread_mutex_t lga_lock = PTHREAD_MUTEX_INITIALIZER;
>> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec
>> > >> >>      /** clean up the channel records for this lga-client
>> > >> >>           **/
>> > >> >>              lga_log_stream_hdl_rec_list_del(&client_hdl-
>> > >> >> >stream_list);
>> > >> >> -    /** remove the association with hdl-mngr
>> > >> >> -         **/
>> > >> >> -            free(client_hdl);
>> > >> >> -            client_hdl = 0;
>> > >> >> +            free_client_hdl(&client_hdl);
>> > >> >>      }
>> > >> >> +
>> > >> >>      TRACE_LEAVE();
>> > >> >>  }
>> > >> >>
>> > >> >> @@ -536,38 +535,35 @@ uint32_t lga_hdl_rec_del(lga_client_hdl_
>> > >> >>      if (list_iter == rm_node) {
>> > >> >>              *list_head = rm_node->next;
>> > >> >>
>> > >> >> -    /** detach & release the IPC
>> > >> >> -         **/
>> > >> >> +            /* detach & release the IPC */
>> > >> >>              m_NCS_IPC_DETACH(&rm_node->mbx,
>> > >> >> lga_clear_mbx, NULL);
>> > >> >>              m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>> > >> >>
>> > >> >>              ncshm_give_hdl(rm_node->local_hdl);
>> > >> >>              ncshm_destroy_hdl(NCS_SERVICE_ID_LGA,
>> > >> >> rm_node->local_hdl);
>> > >> >> -    /** Free the channel records off this hdl
>> > >> >> -         **/
>> > >> >> +
>> > >> >> +            /* Free the channel records off this hdl */
>> > >> >>              lga_log_stream_hdl_rec_list_del(&rm_node-
>> > >> >> >stream_list);
>> > >> >> +            free_client_hdl(&rm_node);
>> > >> >>
>> > >> >> -    /** free the hdl rec
>> > >> >> -         **/
>> > >> >> -            free(rm_node);
>> > >> >>              rc = NCSCC_RC_SUCCESS;
>> > >> >>              goto out;
>> > >> >> -    } else {                /* find the rec */
>> > >> >> +    } else {
>> > >> >> +            /* find the rec */
>> > >> >>              while (NULL != list_iter) {
>> > >> >>                      if (list_iter->next == rm_node) {
>> > >> >>                              list_iter->next =
>> > >> >> rm_node->next;
>> > >> >>
>> > >> >> -            /** detach & release the IPC */
>> > >> >> +                            /* detach & release
>> > >> >> the IPC */
>> > >> >>
>> > >> >>      m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx,
>> > >> NULL);
>> > >> >>
>> > >> >>      m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>> > >> >>
>> > >> >>
>> > >> >>      ncshm_give_hdl(rm_node->local_hdl);
>> > >> >>
>> > >> >>      ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node-
>> > >> >> >local_hdl);
>> > >> >> -            /** Free the channel records off this lga_hdl  */
>> > >> >> +
>> > >> >> +                            /* Free the channel
>> > >> >> records off this lga_hdl  */
>> > >> >>
>> > >> >>      lga_log_stream_hdl_rec_list_del(&rm_node->stream_list);
>> > >> >> -
>> > >> >> -            /** free the hdl rec */
>> > >> >> -                            free(rm_node);
>> > >> >> +
>> > >> >>      free_client_hdl(&rm_node);
>> > >> >>
>> > >> >>                              rc =
>> > >> >> NCSCC_RC_SUCCESS;
>> > >> >>                              goto out;
>> > >>
>> >



------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to