Hi Vu

See my comments inline [Lennart]

Note. My comments does not give the complete solution. For that I need to use 
more time but may give you some ideas.

/Lennart

> -----Original Message-----
> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
> Sent: den 5 april 2016 10:48
> To: Lennart Lund
> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
> in log agent [#1705]
> 
> Hi Lennart,
> 
> Thanks for your proposal.
> 
> Still, there is race condition b/w mds thread and recover/client thread, I
> think.
> 1) Headless while recover thread 2 is running
> 2) Headless while the client thread is accessing/modifying to client list.
> 
[Lennart] In the mds thread the lga_no_server_state_set() is called if server 
down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped
and recovery state is set to state LGA_NO_SERVER. This means that we start all 
over again with recovery state LGA_RECOVERY1 when the server comes back again.
Within the thread the cb lock is taken while the client list is deleted if 
needed.
With a new mutex for lga_state this mutex should also be used in all other 
places where lga_state is handled not the cb_lock. This means that the name 
"recovery2" mutex is not so good instead there should be a lga_state mutex. The 
lga_state variable may also be moved outside the lga_cb structure.

> I would like to propose other one, using only one mutex cb_lock @lga_cb
> to protect common `lga_cb` resource, including `state`, `client` list, etc. in
> three threads.
> 
[Lennart] The cb_lock is used to protect variables in the cb structure but the 
lga_state mutex prevent recovery or API operations from being done at the same 
time.

> 1) Client thread
> - Lock mutex before getting the state
> - Unlock right before the API operation is finished
> 
[Lennart] Yes but the lga_state mutex

> 2) Recovery2 thread
> - Lock mutex after timeout
> - Unlock before ending the thread
> 
[Lennart] No this will cause the API to be locked as long as the recovery 2 
thread is running and this is not wanted. The API shall work also while the 
recovery thread is running and most of the APIs shall return with 
SA_AIS_TRY_AGAIN during recovery 2 state.

> 3) MDS thread
> - Lock/unlock before/after changing state
> In case of headless:
> - Lock/unlock mutex before modifying `client_list` in
> lga_no_server_state_set()
> 
[Lennart] lga_state is not changed directly in the mds thread but indirectly 
when the lga_serv_recov1state_set() and lga_no_server_state_set() functions are 
called. I these functions the recovery thread and state change is handled and 
the mutex used shall be changed to the lga_state mutex.

The client data should be protected by the cb_lock mutex in all places which is 
not the case. There is no protection in the mds trhead in the 
lga_lgs_msg_proc() where client data is used via the ga_find_hdl_rec_by_regid() 
function. This function should lock the lga_cb lock while handling the lga_cb 
client data.
Also there are two functions in lga_state.c that are handling client data 
locking a lga_recov_mutex. This does not protect client data handled in the mds 
thread.

> What is your opinion? Do you think there is impact to log client if they call
> LOG API in multi threads?
> 
[Lennart] The mutex handling should be cleaned up.
 - Recovery state should be protected by its own mutex so that APIs can work in 
respective state without locking.
 - Client data should be protected in all threads by the lga_cb lock.
 - The lga_recov_mutex can probably be removed

> Regards, Vu.
> 
> 
> >-----Original Message-----
> >From: Lennart Lund [mailto:lennart.l...@ericsson.com]
> >Sent: Monday, April 04, 2016 3:51 PM
> >To: Vu Minh Nguyen
> >Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net;
> Lennart
> >Lund
> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
> in
> >log agent [#1705]
> >
> >Hi Vu,
> >
> >A simpler solution than described below, maybe:
> >
> >Still use a separate "recovery2" mutex but a "normal one".
> > - In the API, lock the mutex just before checking if we are in recovery 
> > state
> 2
> >(before "if (lga_state == LGA_RECOVERY2) {")
> > - In the API, do not unlock until the API operation is finished
> > - In the recovery thread lock the mutex just before  changing the recovery
> >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2
> mutex)
> >(protect the lga_state variable)
> >- In the recovery thread the mutex shall be unlocked just after lga_state is
> set
> >to recovery state 2.
> >
> >This means that the recovery thread cannot set recovery state2 or execute
> >until an eventual ongoing API operation is done.
> >If the thread starts and set recovery state just before the API checks
> recovery
> >state the API will correctly see that we are in recovery state 2 and handle
> thing
> >accordingly.
> >
> >Regards
> >Lennart
> >
> >
> >> -----Original Message-----
> >> From: Lennart Lund
> >> Sent: den 1 april 2016 16:51
> >> To: Vu Minh Nguyen
> >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net;
> >> Lennart Lund
> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common
> resource
> >> in log agent [#1705]
> >>
> >> Hi Vu,
> >>
> >> You are right, good finding, this may happen since the recovery2 thread is
> >> started and is waiting for a timeout. If this timeout happen at the "wrong
> >> time" bad things may happen.
> >> However the actual problem should be solved.
> >>
> >> This is how it could be done (maybe you can find a better way?):
> >> When timeout the recovery thread shall set recovery state 2 as is but shall
> >> not start any recovery if there is any ongoing client activity.
> >> This could be solved by using a "recovery2" mutex (not the existing
> >> lga_cb.cb_lock).
> >>  - The recovery thread lock this mutex after setting recovery state 2.
> >>  - During recovery state 1 all APIs check if the mutex is taken and if so
> return
> >> TRY AGAIN (this means that we are in recovery state 2 but the state
> change
> >> may have happened after the check in the API). Note: "trylock" cannot be
> >> used for this purpose since a sequence trylock -> lock is not atomic.
> Maybe a
> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used.
> >>  - If the mutex is not taken it shall be locked by the API function and be
> >> unlocked just before the API function returns preventing the recovery
> >> thread from starting recovery during an ongoing API operation.
> >> The next time an API is called recovery state 2 is set and the API function
> will
> >> operate accordingly.
> >>
> >> Thanks
> >> Lennart
> >>
> >> > -----Original Message-----
> >> > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
> >> > Sent: den 1 april 2016 16:02
> >> > To: Lennart Lund
> >> > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
> >> resource
> >> > in log agent [#1705]
> >> >
> >> > Hi Lennart,
> >> >
> >> > Is it possible for following case happens?
> >> >
> >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG
> >> > service up, recovery thread starts and
> >> > comes to the point right after "while (pclient != NULL)" in
> >> > recovery2_thread @lga_state.c
> >> > Then, the context is switching to the client thread and it deletes the
> >> > resource while recovery2_thread is running?
> >> >
> >> > Regards, Vu.
> >> >
> >> > Quoting Lennart Lund <lennart.l...@ericsson.com>:
> >> >
> >> > > Hi Vu,
> >> > >
> >> > >                /* Recover clients one at a time */
> >> > >                rc = recover_one_client(p_client);
> >> > >  +             if (p_client == NULL) goto done;
> >> > >  +
> >> > >  [Lennart] Why is this needed? This check is done in "while
> >> > > (p_client != NULL) { "
> >> > > and p_client cannot become NULL until "p_client = p_client->next; "
> >> > > is done as
> >> > > the last line in the while loop.
> >> > >
> >> > >  [Vu] It is my intention. I am afraid if the context is switched to
> >> > >  client thread
> >> > >  right after the recovery thread enters the while loop.
> >> > >  If it is the case, p_client will be NULL after done 
> >> > > recover_one_client().
> >> > >
> >> > > [Lennart] I don't really understand your explanation. What is
> >> > > happening in the client thread when we are in recovery state 2 that
> >> > > can set p_client == NULL? The client thread should not be able to do
> >> > > anything with the data involved when recovering client in the
> >> > > recovery thread?
> >> > >
> >> > > Regards
> >> > > Lennart
> >> > >
> >> > >> -----Original Message-----
> >> > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
> >> > >> Sent: den 1 april 2016 14:14
> >> > >> To: Lennart Lund
> >> > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
> >> > resource
> >> > >> in log agent [#1705]
> >> > >>
> >> > >> Hi Lennart,
> >> > >>
> >> > >> Please see my comment inline.
> >> > >>
> >> > >> Regards, Vu
> >> > >>
> >> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>:
> >> > >>
> >> > >> > Hi Vu
> >> > >> >
> >> > >> > Ack with comments
> >> > >> >
> >> > >> > See my comments inline [Lennart]:
> >> > >> >
> >> > >> > Thanks
> >> > >> > Lennart
> >> > >> >
> >> > >> >> -----Original Message-----
> >> > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
> >> > >> >> Sent: den 23 mars 2016 08:16
> >> > >> >> To: mathi.naic...@oracle.com; Lennart Lund
> >> > >> >> Cc: opensaf-devel@lists.sourceforge.net
> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common
> >> > resource
> >> > >> in
> >> > >> >> log agent [#1705]
> >> > >> >>
> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.c |  41
> >> > >> >> +++++++++++++++++++++++++++++++----
> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.h |   1 +
> >> > >> >>  osaf/libs/agents/saf/lga/lga_util.c  |  30 
> >> > >> >> +++++++++++-------------
> -
> >> > >> >>  3 files changed, 50 insertions(+), 22 deletions(-)
> >> > >> >>
> >> > >> >>
> >> > >> >> There was an race condition between client thread and recovery
> >> > thread
> >> > >> >> accessing to common resource `client_list`.
> >> > >> >>
> >> > >> >> Add mutex protection.
> >> > >> >>
> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c
> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c
> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c
> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c
> >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void
> *dumm
> >> > >> >>                    /* We have been signaled to exit
> >> > >> >> */
> >> > >> >>                    goto done;
> >> > >> >>            }
> >> > >> >> +
> >> > >> >>            /* Recover clients one at a time */
> >> > >> >>            rc = recover_one_client(p_client);
> >> > >> >> +          if (p_client == NULL) goto done;
> >> > >> >> +
> >> > >> > [Lennart] Why is this needed? This check is done in "while (p_client
> >> > >> > != NULL) { " and p_client cannot become NULL until "p_client =
> >> > >> > p_client->next; " is done as the last line in the while loop.
> >> > >> >
> >> > >> [Vu] It is my intention. I am afraid if the context is switched to
> >> > >> client thread
> >> > >> right after the recovery thread enters the while loop.
> >> > >> If it is the case, p_client will be NULL after done
> recover_one_client().
> >> > >>
> >> > >> >>            TRACE("\t Client %d is recovered", p_client-
> >> > >> >> >lgs_client_id);
> >> > >> >>            if (rc == -1) {
> >> > >> >>                    TRACE("%s recover_one_client
> >> > >> >> Fail Deleting cllient (id %d)",
> >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re
> >> > >> >>
> >> > >> >>    TRACE_ENTER();
> >> > >> >>
> >> > >> >> +  if (p_client == NULL) {
> >> > >> >> +          /**
> >> > >> >> +           * Probably the client_list has been deleted by
> >> > >> >> client thread
> >> > >> >> +           * in context calling saLogFinalize(). So, give up
> >> > >> >> recovering.
> >> > >> >> +           */
> >> > >> >> +          rc = -1;
> >> > >> >> +          goto done;
> >> > >> >> +  }
> >> > >> >> +
> >> > >> >>    /* This function may be called both in the recovery thread and
> >> > >> >> in the
> >> > >> >>     * client thread. A possible scenario is that the recovery 2
> >> > >> >> thread
> >> > >> >>     * times out and calls this function in recovery mode 2 while
> >> > >> >> there
> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client(
> >> > >> >>    lga_client_hdl_rec_t *rm_node
> >> > >> >>    )
> >> > >> >>  {
> >> > >> >> -  TRACE_ENTER2();
> >> > >> >> -  uint32_t ncs_rc;
> >> > >> >> +  return lga_hdl_rec_del(list_head, rm_node);
> >> > >> >> +}
> >> > >> >>
> >> > >> >> +/**
> >> > >> >> + * Free the memory allocated for one client handle with mutext
> >> > >> protection.
> >> > >> >> + *
> >> > >> >> + * @param p_client_hdl
> >> > >> >> + * @return void
> >> > >> >> + */
> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl)
> >> > >> >> +{
> >> > >> >> +  lga_client_hdl_rec_t *client_hdl;
> >> > >> >> +
> >> > >> >> +  /**
> >> > >> >> +   * To prevent race condition b/w client thread and recovery
> >> > >> >> thread
> >> > >> >> +   * accessing to common resource `client_list`. [#1705]
> >> > >> >> +   */
> >> > >> >>    osaf_mutex_lock_ordie(&lga_recov_mutex);
> >> > >> >> -  ncs_rc = lga_hdl_rec_del(list_head, rm_node);
> >> > >> >> +
> >> > >> >> +  client_hdl = *p_client_hdl;
> >> > >> >> +  if (client_hdl == NULL) goto done;
> >> > >> >> +
> >> > >> >> +  free(client_hdl);
> >> > >> >> +  client_hdl = NULL;
> >> > >> >> +
> >> > >> >> +done:
> >> > >> >>    osaf_mutex_unlock_ordie(&lga_recov_mutex);
> >> > >> >>
> >> > >> >> -  TRACE_LEAVE();
> >> > >> >> -  return ncs_rc;
> >> > >> >>  }
> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h
> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h
> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h
> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h
> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client(
> >> > >> >>    lga_client_hdl_rec_t **list_head,
> >> > >> >>    lga_client_hdl_rec_t *rm_node
> >> > >> >>    );
> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl);
> >> > >> >>
> >> > >> >>  #ifdef    __cplusplus
> >> > >> >>  }
> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c
> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c
> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c
> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c
> >> > >> >> @@ -19,6 +19,7 @@
> >> > >> >>  #include <syslog.h>
> >> > >> >>  #include "lga.h"
> >> > >> >>  #include "osaf_poll.h"
> >> > >> >> +#include "lga_state.h"
> >> > >> >>
> >> > >> >>  /* Variables used during startup/shutdown only */
> >> > >> >>  static pthread_mutex_t lga_lock =
> PTHREAD_MUTEX_INITIALIZER;
> >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec
> >> > >> >>    /** clean up the channel records for this lga-client
> >> > >> >>           **/
> >> > >> >>            lga_log_stream_hdl_rec_list_del(&client_hdl-
> >> > >> >> >stream_list);
> >> > >> >> -  /** remove the association with hdl-mngr
> >> > >> >> -         **/
> >> > >> >> -          free(client_hdl);
> >> > >> >> -          client_hdl = 0;
> >> > >> >> +          free_client_hdl(&client_hdl);
> >> > >> >>    }
> >> > >> >> +
> >> > >> >>    TRACE_LEAVE();
> >> > >> >>  }
> >> > >> >>
> >> > >> >> @@ -536,38 +535,35 @@ uint32_t
> lga_hdl_rec_del(lga_client_hdl_
> >> > >> >>    if (list_iter == rm_node) {
> >> > >> >>            *list_head = rm_node->next;
> >> > >> >>
> >> > >> >> -  /** detach & release the IPC
> >> > >> >> -         **/
> >> > >> >> +          /* detach & release the IPC */
> >> > >> >>            m_NCS_IPC_DETACH(&rm_node->mbx,
> >> > >> >> lga_clear_mbx, NULL);
> >> > >> >>            m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
> >> > >> >>
> >> > >> >>            ncshm_give_hdl(rm_node->local_hdl);
> >> > >> >>            ncshm_destroy_hdl(NCS_SERVICE_ID_LGA,
> >> > >> >> rm_node->local_hdl);
> >> > >> >> -  /** Free the channel records off this hdl
> >> > >> >> -         **/
> >> > >> >> +
> >> > >> >> +          /* Free the channel records off this hdl */
> >> > >> >>            lga_log_stream_hdl_rec_list_del(&rm_node-
> >> > >> >> >stream_list);
> >> > >> >> +          free_client_hdl(&rm_node);
> >> > >> >>
> >> > >> >> -  /** free the hdl rec
> >> > >> >> -         **/
> >> > >> >> -          free(rm_node);
> >> > >> >>            rc = NCSCC_RC_SUCCESS;
> >> > >> >>            goto out;
> >> > >> >> -  } else {                /* find the rec */
> >> > >> >> +  } else {
> >> > >> >> +          /* find the rec */
> >> > >> >>            while (NULL != list_iter) {
> >> > >> >>                    if (list_iter->next == rm_node) {
> >> > >> >>                            list_iter->next =
> >> > >> >> rm_node->next;
> >> > >> >>
> >> > >> >> -          /** detach & release the IPC */
> >> > >> >> +                          /* detach & release
> >> > >> >> the IPC */
> >> > >> >>
> >> > >> >>    m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx,
> >> > >> NULL);
> >> > >> >>
> >> > >> >>    m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
> >> > >> >>
> >> > >> >>
> >> > >> >>    ncshm_give_hdl(rm_node->local_hdl);
> >> > >> >>
> >> > >> >>    ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node-
> >> > >> >> >local_hdl);
> >> > >> >> -          /** Free the channel records off this lga_hdl  */
> >> > >> >> +
> >> > >> >> +                          /* Free the channel
> >> > >> >> records off this lga_hdl  */
> >> > >> >>
> >> > >> >>    lga_log_stream_hdl_rec_list_del(&rm_node->stream_list);
> >> > >> >> -
> >> > >> >> -          /** free the hdl rec */
> >> > >> >> -                          free(rm_node);
> >> > >> >> +
> >> > >> >>    free_client_hdl(&rm_node);
> >> > >> >>
> >> > >> >>                            rc =
> >> > >> >> NCSCC_RC_SUCCESS;
> >> > >> >>                            goto out;
> >> > >>
> >> >
> 

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to