Re: [devel] [PATCH 1 of 1] log: miss mutex protection for common resource in log agent [#1705]

Vu Minh Nguyen Fri, 08 Apr 2016 03:04:30 -0700

Hi all,

Have you had time to look at this? Thanks.


Regards, Vu.


>-----Original Message-----
>From: Vu Minh Nguyen [mailto:[email protected]]
>Sent: Wednesday, April 06, 2016 12:44 PM
>To: 'Lennart Lund'
>Cc: '[email protected]'; '[email protected]'
>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in
>log agent [#1705]
>
>Hi Lennart,
>
>Thanks for your comments.
>
>I think it is good idea to pick the lgs_state & lga_state out of lga_cb,
>and use the mutex cb_lock to protect changes in lga control block.
>
>I reflect these things in attached file.
>
>And to make the code a bit clean, I have defined set/get functions for 
>lgs_state
>& lga_state.
>Also, I declare the mutex cb_lock as recursive one to avoid possible 
>recursively
>lock in one thread.
>
>Could you have a look and give your comments?
>
>Regards, Vu.
>
>>-----Original Message-----
>>From: Lennart Lund [mailto:[email protected]]
>>Sent: Tuesday, April 05, 2016 6:59 PM
>>To: Vu Minh Nguyen
>>Cc: [email protected]; [email protected]; Lennart
>>Lund
>>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in
>>log agent [#1705]
>>
>>Hi Vu
>>
>>See my comments inline [Lennart]
>>
>>Note. My comments does not give the complete solution. For that I need to
>>use more time but may give you some ideas.
>>
>>/Lennart
>>
>>> -----Original Message-----
>>> From: Vu Minh Nguyen [mailto:[email protected]]
>>> Sent: den 5 april 2016 10:48
>>> To: Lennart Lund
>>> Cc: [email protected]; [email protected]
>>> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
>>> in log agent [#1705]
>>>
>>> Hi Lennart,
>>>
>>> Thanks for your proposal.
>>>
>>> Still, there is race condition b/w mds thread and recover/client thread, I
>>> think.
>>> 1) Headless while recover thread 2 is running
>>> 2) Headless while the client thread is accessing/modifying to client list.
>>>
>>[Lennart] In the mds thread the lga_no_server_state_set() is called if server
>>down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped
>>and recovery state is set to state LGA_NO_SERVER. This means that we start
>all
>>over again with recovery state LGA_RECOVERY1 when the server comes back
>>again.
>>Within the thread the cb lock is taken while the client list is deleted if
>needed.
>>With a new mutex for lga_state this mutex should also be used in all other
>>places where lga_state is handled not the cb_lock. This means that the name
>>"recovery2" mutex is not so good instead there should be a lga_state mutex.
>>The lga_state variable may also be moved outside the lga_cb structure.
>>
>>> I would like to propose other one, using only one mutex cb_lock @lga_cb
>>> to protect common `lga_cb` resource, including `state`, `client` list, etc. 
>>> in
>>> three threads.
>>>
>>[Lennart] The cb_lock is used to protect variables in the cb structure but the
>>lga_state mutex prevent recovery or API operations from being done at the
>>same time.
>>
>>> 1) Client thread
>>> - Lock mutex before getting the state
>>> - Unlock right before the API operation is finished
>>>
>>[Lennart] Yes but the lga_state mutex
>>
>>> 2) Recovery2 thread
>>> - Lock mutex after timeout
>>> - Unlock before ending the thread
>>>
>>[Lennart] No this will cause the API to be locked as long as the recovery 2
>>thread is running and this is not wanted. The API shall work also while the
>>recovery thread is running and most of the APIs shall return with
>>SA_AIS_TRY_AGAIN during recovery 2 state.
>>
>>> 3) MDS thread
>>> - Lock/unlock before/after changing state
>>> In case of headless:
>>> - Lock/unlock mutex before modifying `client_list` in
>>> lga_no_server_state_set()
>>>
>>[Lennart] lga_state is not changed directly in the mds thread but indirectly
>>when the lga_serv_recov1state_set() and lga_no_server_state_set() functions
>>are called. I these functions the recovery thread and state change is handled
>>and the mutex used shall be changed to the lga_state mutex.
>>
>>The client data should be protected by the cb_lock mutex in all places which
>is
>>not the case. There is no protection in the mds trhead in the
>>lga_lgs_msg_proc() where client data is used via the
>>ga_find_hdl_rec_by_regid() function. This function should lock the lga_cb lock
>>while handling the lga_cb client data.
>>Also there are two functions in lga_state.c that are handling client data
>locking
>>a lga_recov_mutex. This does not protect client data handled in the mds
>>thread.
>>
>>> What is your opinion? Do you think there is impact to log client if they 
>>> call
>>> LOG API in multi threads?
>>>
>>[Lennart] The mutex handling should be cleaned up.
>> - Recovery state should be protected by its own mutex so that APIs can work
>>in respective state without locking.
>> - Client data should be protected in all threads by the lga_cb lock.
>> - The lga_recov_mutex can probably be removed
>>
>>> Regards, Vu.
>>>
>>>
>>> >-----Original Message-----
>>> >From: Lennart Lund [mailto:[email protected]]
>>> >Sent: Monday, April 04, 2016 3:51 PM
>>> >To: Vu Minh Nguyen
>>> >Cc: [email protected]; [email protected];
>>> Lennart
>>> >Lund
>>> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common
>resource
>>> in
>>> >log agent [#1705]
>>> >
>>> >Hi Vu,
>>> >
>>> >A simpler solution than described below, maybe:
>>> >
>>> >Still use a separate "recovery2" mutex but a "normal one".
>>> > - In the API, lock the mutex just before checking if we are in recovery 
>>> > state
>>> 2
>>> >(before "if (lga_state == LGA_RECOVERY2) {")
>>> > - In the API, do not unlock until the API operation is finished
>>> > - In the recovery thread lock the mutex just before  changing the recovery
>>> >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2
>>> mutex)
>>> >(protect the lga_state variable)
>>> >- In the recovery thread the mutex shall be unlocked just after lga_state 
>>> >is
>>> set
>>> >to recovery state 2.
>>> >
>>> >This means that the recovery thread cannot set recovery state2 or execute
>>> >until an eventual ongoing API operation is done.
>>> >If the thread starts and set recovery state just before the API checks
>>> recovery
>>> >state the API will correctly see that we are in recovery state 2 and handle
>>> thing
>>> >accordingly.
>>> >
>>> >Regards
>>> >Lennart
>>> >
>>> >
>>> >> -----Original Message-----
>>> >> From: Lennart Lund
>>> >> Sent: den 1 april 2016 16:51
>>> >> To: Vu Minh Nguyen
>>> >> Cc: [email protected]; [email protected];
>>> >> Lennart Lund
>>> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common
>>> resource
>>> >> in log agent [#1705]
>>> >>
>>> >> Hi Vu,
>>> >>
>>> >> You are right, good finding, this may happen since the recovery2 thread
>is
>>> >> started and is waiting for a timeout. If this timeout happen at the 
>>> >> "wrong
>>> >> time" bad things may happen.
>>> >> However the actual problem should be solved.
>>> >>
>>> >> This is how it could be done (maybe you can find a better way?):
>>> >> When timeout the recovery thread shall set recovery state 2 as is but
>shall
>>> >> not start any recovery if there is any ongoing client activity.
>>> >> This could be solved by using a "recovery2" mutex (not the existing
>>> >> lga_cb.cb_lock).
>>> >>  - The recovery thread lock this mutex after setting recovery state 2.
>>> >>  - During recovery state 1 all APIs check if the mutex is taken and if so
>>> return
>>> >> TRY AGAIN (this means that we are in recovery state 2 but the state
>>> change
>>> >> may have happened after the check in the API). Note: "trylock" cannot be
>>> >> used for this purpose since a sequence trylock -> lock is not atomic.
>>> Maybe a
>>> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used.
>>> >>  - If the mutex is not taken it shall be locked by the API function and 
>>> >> be
>>> >> unlocked just before the API function returns preventing the recovery
>>> >> thread from starting recovery during an ongoing API operation.
>>> >> The next time an API is called recovery state 2 is set and the API 
>>> >> function
>>> will
>>> >> operate accordingly.
>>> >>
>>> >> Thanks
>>> >> Lennart
>>> >>
>>> >> > -----Original Message-----
>>> >> > From: Vu Minh Nguyen [mailto:[email protected]]
>>> >> > Sent: den 1 april 2016 16:02
>>> >> > To: Lennart Lund
>>> >> > Cc: [email protected]; [email protected]
>>> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>>> >> resource
>>> >> > in log agent [#1705]
>>> >> >
>>> >> > Hi Lennart,
>>> >> >
>>> >> > Is it possible for following case happens?
>>> >> >
>>> >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG
>>> >> > service up, recovery thread starts and
>>> >> > comes to the point right after "while (pclient != NULL)" in
>>> >> > recovery2_thread @lga_state.c
>>> >> > Then, the context is switching to the client thread and it deletes the
>>> >> > resource while recovery2_thread is running?
>>> >> >
>>> >> > Regards, Vu.
>>> >> >
>>> >> > Quoting Lennart Lund <[email protected]>:
>>> >> >
>>> >> > > Hi Vu,
>>> >> > >
>>> >> > >              /* Recover clients one at a time */
>>> >> > >              rc = recover_one_client(p_client);
>>> >> > >  +           if (p_client == NULL) goto done;
>>> >> > >  +
>>> >> > >  [Lennart] Why is this needed? This check is done in "while
>>> >> > > (p_client != NULL) { "
>>> >> > > and p_client cannot become NULL until "p_client = p_client->next; "
>>> >> > > is done as
>>> >> > > the last line in the while loop.
>>> >> > >
>>> >> > >  [Vu] It is my intention. I am afraid if the context is switched to
>>> >> > >  client thread
>>> >> > >  right after the recovery thread enters the while loop.
>>> >> > >  If it is the case, p_client will be NULL after done 
>>> >> > > recover_one_client().
>>> >> > >
>>> >> > > [Lennart] I don't really understand your explanation. What is
>>> >> > > happening in the client thread when we are in recovery state 2 that
>>> >> > > can set p_client == NULL? The client thread should not be able to do
>>> >> > > anything with the data involved when recovering client in the
>>> >> > > recovery thread?
>>> >> > >
>>> >> > > Regards
>>> >> > > Lennart
>>> >> > >
>>> >> > >> -----Original Message-----
>>> >> > >> From: Vu Minh Nguyen [mailto:[email protected]]
>>> >> > >> Sent: den 1 april 2016 14:14
>>> >> > >> To: Lennart Lund
>>> >> > >> Cc: [email protected]; [email protected]
>>> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>>> >> > resource
>>> >> > >> in log agent [#1705]
>>> >> > >>
>>> >> > >> Hi Lennart,
>>> >> > >>
>>> >> > >> Please see my comment inline.
>>> >> > >>
>>> >> > >> Regards, Vu
>>> >> > >>
>>> >> > >> Quoting Lennart Lund <[email protected]>:
>>> >> > >>
>>> >> > >> > Hi Vu
>>> >> > >> >
>>> >> > >> > Ack with comments
>>> >> > >> >
>>> >> > >> > See my comments inline [Lennart]:
>>> >> > >> >
>>> >> > >> > Thanks
>>> >> > >> > Lennart
>>> >> > >> >
>>> >> > >> >> -----Original Message-----
>>> >> > >> >> From: Vu Minh Nguyen [mailto:[email protected]]
>>> >> > >> >> Sent: den 23 mars 2016 08:16
>>> >> > >> >> To: [email protected]; Lennart Lund
>>> >> > >> >> Cc: [email protected]
>>> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common
>>> >> > resource
>>> >> > >> in
>>> >> > >> >> log agent [#1705]
>>> >> > >> >>
>>> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.c |  41
>>> >> > >> >> +++++++++++++++++++++++++++++++----
>>> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.h |   1 +
>>> >> > >> >>  osaf/libs/agents/saf/lga/lga_util.c  |  30 
>>> >> > >> >> +++++++++++-------------
>>> -
>>> >> > >> >>  3 files changed, 50 insertions(+), 22 deletions(-)
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >> There was an race condition between client thread and recovery
>>> >> > thread
>>> >> > >> >> accessing to common resource `client_list`.
>>> >> > >> >>
>>> >> > >> >> Add mutex protection.
>>> >> > >> >>
>>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c
>>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c
>>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c
>>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c
>>> >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void
>>> *dumm
>>> >> > >> >>                          /* We have been signaled to exit
>>> >> > >> >> */
>>> >> > >> >>                          goto done;
>>> >> > >> >>                  }
>>> >> > >> >> +
>>> >> > >> >>                  /* Recover clients one at a time */
>>> >> > >> >>                  rc = recover_one_client(p_client);
>>> >> > >> >> +                if (p_client == NULL) goto done;
>>> >> > >> >> +
>>> >> > >> > [Lennart] Why is this needed? This check is done in "while
>(p_client
>>> >> > >> > != NULL) { " and p_client cannot become NULL until "p_client =
>>> >> > >> > p_client->next; " is done as the last line in the while loop.
>>> >> > >> >
>>> >> > >> [Vu] It is my intention. I am afraid if the context is switched to
>>> >> > >> client thread
>>> >> > >> right after the recovery thread enters the while loop.
>>> >> > >> If it is the case, p_client will be NULL after done
>>> recover_one_client().
>>> >> > >>
>>> >> > >> >>                  TRACE("\t Client %d is recovered", p_client-
>>> >> > >> >> >lgs_client_id);
>>> >> > >> >>                  if (rc == -1) {
>>> >> > >> >>                          TRACE("%s recover_one_client
>>> >> > >> >> Fail Deleting cllient (id %d)",
>>> >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re
>>> >> > >> >>
>>> >> > >> >>          TRACE_ENTER();
>>> >> > >> >>
>>> >> > >> >> +        if (p_client == NULL) {
>>> >> > >> >> +                /**
>>> >> > >> >> +                 * Probably the client_list has been deleted by
>>> >> > >> >> client thread
>>> >> > >> >> +                 * in context calling saLogFinalize(). So, give 
>>> >> > >> >> up
>>> >> > >> >> recovering.
>>> >> > >> >> +                 */
>>> >> > >> >> +                rc = -1;
>>> >> > >> >> +                goto done;
>>> >> > >> >> +        }
>>> >> > >> >> +
>>> >> > >> >>          /* This function may be called both in the recovery 
>>> >> > >> >> thread and
>>> >> > >> >> in the
>>> >> > >> >>           * client thread. A possible scenario is that the 
>>> >> > >> >> recovery 2
>>> >> > >> >> thread
>>> >> > >> >>           * times out and calls this function in recovery mode 2 
>>> >> > >> >> while
>>> >> > >> >> there
>>> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client(
>>> >> > >> >>          lga_client_hdl_rec_t *rm_node
>>> >> > >> >>          )
>>> >> > >> >>  {
>>> >> > >> >> -        TRACE_ENTER2();
>>> >> > >> >> -        uint32_t ncs_rc;
>>> >> > >> >> +        return lga_hdl_rec_del(list_head, rm_node);
>>> >> > >> >> +}
>>> >> > >> >>
>>> >> > >> >> +/**
>>> >> > >> >> + * Free the memory allocated for one client handle with mutext
>>> >> > >> protection.
>>> >> > >> >> + *
>>> >> > >> >> + * @param p_client_hdl
>>> >> > >> >> + * @return void
>>> >> > >> >> + */
>>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl)
>>> >> > >> >> +{
>>> >> > >> >> +        lga_client_hdl_rec_t *client_hdl;
>>> >> > >> >> +
>>> >> > >> >> +        /**
>>> >> > >> >> +         * To prevent race condition b/w client thread and 
>>> >> > >> >> recovery
>>> >> > >> >> thread
>>> >> > >> >> +         * accessing to common resource `client_list`. [#1705]
>>> >> > >> >> +         */
>>> >> > >> >>          osaf_mutex_lock_ordie(&lga_recov_mutex);
>>> >> > >> >> -        ncs_rc = lga_hdl_rec_del(list_head, rm_node);
>>> >> > >> >> +
>>> >> > >> >> +        client_hdl = *p_client_hdl;
>>> >> > >> >> +        if (client_hdl == NULL) goto done;
>>> >> > >> >> +
>>> >> > >> >> +        free(client_hdl);
>>> >> > >> >> +        client_hdl = NULL;
>>> >> > >> >> +
>>> >> > >> >> +done:
>>> >> > >> >>          osaf_mutex_unlock_ordie(&lga_recov_mutex);
>>> >> > >> >>
>>> >> > >> >> -        TRACE_LEAVE();
>>> >> > >> >> -        return ncs_rc;
>>> >> > >> >>  }
>>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h
>>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h
>>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h
>>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h
>>> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client(
>>> >> > >> >>          lga_client_hdl_rec_t **list_head,
>>> >> > >> >>          lga_client_hdl_rec_t *rm_node
>>> >> > >> >>          );
>>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl);
>>> >> > >> >>
>>> >> > >> >>  #ifdef  __cplusplus
>>> >> > >> >>  }
>>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c
>>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c
>>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c
>>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c
>>> >> > >> >> @@ -19,6 +19,7 @@
>>> >> > >> >>  #include <syslog.h>
>>> >> > >> >>  #include "lga.h"
>>> >> > >> >>  #include "osaf_poll.h"
>>> >> > >> >> +#include "lga_state.h"
>>> >> > >> >>
>>> >> > >> >>  /* Variables used during startup/shutdown only */
>>> >> > >> >>  static pthread_mutex_t lga_lock =
>>> PTHREAD_MUTEX_INITIALIZER;
>>> >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec
>>> >> > >> >>          /** clean up the channel records for this lga-client
>>> >> > >> >>           **/
>>> >> > >> >>                  lga_log_stream_hdl_rec_list_del(&client_hdl-
>>> >> > >> >> >stream_list);
>>> >> > >> >> -        /** remove the association with hdl-mngr
>>> >> > >> >> -         **/
>>> >> > >> >> -                free(client_hdl);
>>> >> > >> >> -                client_hdl = 0;
>>> >> > >> >> +                free_client_hdl(&client_hdl);
>>> >> > >> >>          }
>>> >> > >> >> +
>>> >> > >> >>          TRACE_LEAVE();
>>> >> > >> >>  }
>>> >> > >> >>
>>> >> > >> >> @@ -536,38 +535,35 @@ uint32_t
>>> lga_hdl_rec_del(lga_client_hdl_
>>> >> > >> >>          if (list_iter == rm_node) {
>>> >> > >> >>                  *list_head = rm_node->next;
>>> >> > >> >>
>>> >> > >> >> -        /** detach & release the IPC
>>> >> > >> >> -         **/
>>> >> > >> >> +                /* detach & release the IPC */
>>> >> > >> >>                  m_NCS_IPC_DETACH(&rm_node->mbx,
>>> >> > >> >> lga_clear_mbx, NULL);
>>> >> > >> >>                  m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>>> >> > >> >>
>>> >> > >> >>                  ncshm_give_hdl(rm_node->local_hdl);
>>> >> > >> >>                  ncshm_destroy_hdl(NCS_SERVICE_ID_LGA,
>>> >> > >> >> rm_node->local_hdl);
>>> >> > >> >> -        /** Free the channel records off this hdl
>>> >> > >> >> -         **/
>>> >> > >> >> +
>>> >> > >> >> +                /* Free the channel records off this hdl */
>>> >> > >> >>                  lga_log_stream_hdl_rec_list_del(&rm_node-
>>> >> > >> >> >stream_list);
>>> >> > >> >> +                free_client_hdl(&rm_node);
>>> >> > >> >>
>>> >> > >> >> -        /** free the hdl rec
>>> >> > >> >> -         **/
>>> >> > >> >> -                free(rm_node);
>>> >> > >> >>                  rc = NCSCC_RC_SUCCESS;
>>> >> > >> >>                  goto out;
>>> >> > >> >> -        } else {                /* find the rec */
>>> >> > >> >> +        } else {
>>> >> > >> >> +                /* find the rec */
>>> >> > >> >>                  while (NULL != list_iter) {
>>> >> > >> >>                          if (list_iter->next == rm_node) {
>>> >> > >> >>                                  list_iter->next =
>>> >> > >> >> rm_node->next;
>>> >> > >> >>
>>> >> > >> >> -                /** detach & release the IPC */
>>> >> > >> >> +                                /* detach & release
>>> >> > >> >> the IPC */
>>> >> > >> >>
>>> >> > >> >>  m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx,
>>> >> > >> NULL);
>>> >> > >> >>
>>> >> > >> >>  m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>>> >> > >> >>
>>> >> > >> >>
>>> >> > >> >>  ncshm_give_hdl(rm_node->local_hdl);
>>> >> > >> >>
>>> >> > >> >>  ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node-
>>> >> > >> >> >local_hdl);
>>> >> > >> >> -                /** Free the channel records off this lga_hdl  
>>> >> > >> >> */
>>> >> > >> >> +
>>> >> > >> >> +                                /* Free the channel
>>> >> > >> >> records off this lga_hdl  */
>>> >> > >> >>
>>> >> > >> >>  lga_log_stream_hdl_rec_list_del(&rm_node->stream_list);
>>> >> > >> >> -
>>> >> > >> >> -                /** free the hdl rec */
>>> >> > >> >> -                                free(rm_node);
>>> >> > >> >> +
>>> >> > >> >>  free_client_hdl(&rm_node);
>>> >> > >> >>
>>> >> > >> >>                                  rc =
>>> >> > >> >> NCSCC_RC_SUCCESS;
>>> >> > >> >>                                  goto out;
>>> >> > >>
>>> >> >
>>>



------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] log: miss mutex protection for common resource in log agent [#1705]

Reply via email to