Hi Vu See my comments inline [Lennart]
Note. My comments does not give the complete solution. For that I need to use more time but may give you some ideas. /Lennart > -----Original Message----- > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > Sent: den 5 april 2016 10:48 > To: Lennart Lund > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net > Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource > in log agent [#1705] > > Hi Lennart, > > Thanks for your proposal. > > Still, there is race condition b/w mds thread and recover/client thread, I > think. > 1) Headless while recover thread 2 is running > 2) Headless while the client thread is accessing/modifying to client list. > [Lennart] In the mds thread the lga_no_server_state_set() is called if server down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped and recovery state is set to state LGA_NO_SERVER. This means that we start all over again with recovery state LGA_RECOVERY1 when the server comes back again. Within the thread the cb lock is taken while the client list is deleted if needed. With a new mutex for lga_state this mutex should also be used in all other places where lga_state is handled not the cb_lock. This means that the name "recovery2" mutex is not so good instead there should be a lga_state mutex. The lga_state variable may also be moved outside the lga_cb structure. > I would like to propose other one, using only one mutex cb_lock @lga_cb > to protect common `lga_cb` resource, including `state`, `client` list, etc. in > three threads. > [Lennart] The cb_lock is used to protect variables in the cb structure but the lga_state mutex prevent recovery or API operations from being done at the same time. > 1) Client thread > - Lock mutex before getting the state > - Unlock right before the API operation is finished > [Lennart] Yes but the lga_state mutex > 2) Recovery2 thread > - Lock mutex after timeout > - Unlock before ending the thread > [Lennart] No this will cause the API to be locked as long as the recovery 2 thread is running and this is not wanted. The API shall work also while the recovery thread is running and most of the APIs shall return with SA_AIS_TRY_AGAIN during recovery 2 state. > 3) MDS thread > - Lock/unlock before/after changing state > In case of headless: > - Lock/unlock mutex before modifying `client_list` in > lga_no_server_state_set() > [Lennart] lga_state is not changed directly in the mds thread but indirectly when the lga_serv_recov1state_set() and lga_no_server_state_set() functions are called. I these functions the recovery thread and state change is handled and the mutex used shall be changed to the lga_state mutex. The client data should be protected by the cb_lock mutex in all places which is not the case. There is no protection in the mds trhead in the lga_lgs_msg_proc() where client data is used via the ga_find_hdl_rec_by_regid() function. This function should lock the lga_cb lock while handling the lga_cb client data. Also there are two functions in lga_state.c that are handling client data locking a lga_recov_mutex. This does not protect client data handled in the mds thread. > What is your opinion? Do you think there is impact to log client if they call > LOG API in multi threads? > [Lennart] The mutex handling should be cleaned up. - Recovery state should be protected by its own mutex so that APIs can work in respective state without locking. - Client data should be protected in all threads by the lga_cb lock. - The lga_recov_mutex can probably be removed > Regards, Vu. > > > >-----Original Message----- > >From: Lennart Lund [mailto:lennart.l...@ericsson.com] > >Sent: Monday, April 04, 2016 3:51 PM > >To: Vu Minh Nguyen > >Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; > Lennart > >Lund > >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource > in > >log agent [#1705] > > > >Hi Vu, > > > >A simpler solution than described below, maybe: > > > >Still use a separate "recovery2" mutex but a "normal one". > > - In the API, lock the mutex just before checking if we are in recovery > > state > 2 > >(before "if (lga_state == LGA_RECOVERY2) {") > > - In the API, do not unlock until the API operation is finished > > - In the recovery thread lock the mutex just before changing the recovery > >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2 > mutex) > >(protect the lga_state variable) > >- In the recovery thread the mutex shall be unlocked just after lga_state is > set > >to recovery state 2. > > > >This means that the recovery thread cannot set recovery state2 or execute > >until an eventual ongoing API operation is done. > >If the thread starts and set recovery state just before the API checks > recovery > >state the API will correctly see that we are in recovery state 2 and handle > thing > >accordingly. > > > >Regards > >Lennart > > > > > >> -----Original Message----- > >> From: Lennart Lund > >> Sent: den 1 april 2016 16:51 > >> To: Vu Minh Nguyen > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; > >> Lennart Lund > >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > resource > >> in log agent [#1705] > >> > >> Hi Vu, > >> > >> You are right, good finding, this may happen since the recovery2 thread is > >> started and is waiting for a timeout. If this timeout happen at the "wrong > >> time" bad things may happen. > >> However the actual problem should be solved. > >> > >> This is how it could be done (maybe you can find a better way?): > >> When timeout the recovery thread shall set recovery state 2 as is but shall > >> not start any recovery if there is any ongoing client activity. > >> This could be solved by using a "recovery2" mutex (not the existing > >> lga_cb.cb_lock). > >> - The recovery thread lock this mutex after setting recovery state 2. > >> - During recovery state 1 all APIs check if the mutex is taken and if so > return > >> TRY AGAIN (this means that we are in recovery state 2 but the state > change > >> may have happened after the check in the API). Note: "trylock" cannot be > >> used for this purpose since a sequence trylock -> lock is not atomic. > Maybe a > >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used. > >> - If the mutex is not taken it shall be locked by the API function and be > >> unlocked just before the API function returns preventing the recovery > >> thread from starting recovery during an ongoing API operation. > >> The next time an API is called recovery state 2 is set and the API function > will > >> operate accordingly. > >> > >> Thanks > >> Lennart > >> > >> > -----Original Message----- > >> > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> > Sent: den 1 april 2016 16:02 > >> > To: Lennart Lund > >> > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net > >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common > >> resource > >> > in log agent [#1705] > >> > > >> > Hi Lennart, > >> > > >> > Is it possible for following case happens? > >> > > >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG > >> > service up, recovery thread starts and > >> > comes to the point right after "while (pclient != NULL)" in > >> > recovery2_thread @lga_state.c > >> > Then, the context is switching to the client thread and it deletes the > >> > resource while recovery2_thread is running? > >> > > >> > Regards, Vu. > >> > > >> > Quoting Lennart Lund <lennart.l...@ericsson.com>: > >> > > >> > > Hi Vu, > >> > > > >> > > /* Recover clients one at a time */ > >> > > rc = recover_one_client(p_client); > >> > > + if (p_client == NULL) goto done; > >> > > + > >> > > [Lennart] Why is this needed? This check is done in "while > >> > > (p_client != NULL) { " > >> > > and p_client cannot become NULL until "p_client = p_client->next; " > >> > > is done as > >> > > the last line in the while loop. > >> > > > >> > > [Vu] It is my intention. I am afraid if the context is switched to > >> > > client thread > >> > > right after the recovery thread enters the while loop. > >> > > If it is the case, p_client will be NULL after done > >> > > recover_one_client(). > >> > > > >> > > [Lennart] I don't really understand your explanation. What is > >> > > happening in the client thread when we are in recovery state 2 that > >> > > can set p_client == NULL? The client thread should not be able to do > >> > > anything with the data involved when recovering client in the > >> > > recovery thread? > >> > > > >> > > Regards > >> > > Lennart > >> > > > >> > >> -----Original Message----- > >> > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> > >> Sent: den 1 april 2016 14:14 > >> > >> To: Lennart Lund > >> > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net > >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common > >> > resource > >> > >> in log agent [#1705] > >> > >> > >> > >> Hi Lennart, > >> > >> > >> > >> Please see my comment inline. > >> > >> > >> > >> Regards, Vu > >> > >> > >> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>: > >> > >> > >> > >> > Hi Vu > >> > >> > > >> > >> > Ack with comments > >> > >> > > >> > >> > See my comments inline [Lennart]: > >> > >> > > >> > >> > Thanks > >> > >> > Lennart > >> > >> > > >> > >> >> -----Original Message----- > >> > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> > >> >> Sent: den 23 mars 2016 08:16 > >> > >> >> To: mathi.naic...@oracle.com; Lennart Lund > >> > >> >> Cc: opensaf-devel@lists.sourceforge.net > >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common > >> > resource > >> > >> in > >> > >> >> log agent [#1705] > >> > >> >> > >> > >> >> osaf/libs/agents/saf/lga/lga_state.c | 41 > >> > >> >> +++++++++++++++++++++++++++++++---- > >> > >> >> osaf/libs/agents/saf/lga/lga_state.h | 1 + > >> > >> >> osaf/libs/agents/saf/lga/lga_util.c | 30 > >> > >> >> +++++++++++------------- > - > >> > >> >> 3 files changed, 50 insertions(+), 22 deletions(-) > >> > >> >> > >> > >> >> > >> > >> >> There was an race condition between client thread and recovery > >> > thread > >> > >> >> accessing to common resource `client_list`. > >> > >> >> > >> > >> >> Add mutex protection. > >> > >> >> > >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c > >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c > >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c > >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c > >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void > *dumm > >> > >> >> /* We have been signaled to exit > >> > >> >> */ > >> > >> >> goto done; > >> > >> >> } > >> > >> >> + > >> > >> >> /* Recover clients one at a time */ > >> > >> >> rc = recover_one_client(p_client); > >> > >> >> + if (p_client == NULL) goto done; > >> > >> >> + > >> > >> > [Lennart] Why is this needed? This check is done in "while (p_client > >> > >> > != NULL) { " and p_client cannot become NULL until "p_client = > >> > >> > p_client->next; " is done as the last line in the while loop. > >> > >> > > >> > >> [Vu] It is my intention. I am afraid if the context is switched to > >> > >> client thread > >> > >> right after the recovery thread enters the while loop. > >> > >> If it is the case, p_client will be NULL after done > recover_one_client(). > >> > >> > >> > >> >> TRACE("\t Client %d is recovered", p_client- > >> > >> >> >lgs_client_id); > >> > >> >> if (rc == -1) { > >> > >> >> TRACE("%s recover_one_client > >> > >> >> Fail Deleting cllient (id %d)", > >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re > >> > >> >> > >> > >> >> TRACE_ENTER(); > >> > >> >> > >> > >> >> + if (p_client == NULL) { > >> > >> >> + /** > >> > >> >> + * Probably the client_list has been deleted by > >> > >> >> client thread > >> > >> >> + * in context calling saLogFinalize(). So, give up > >> > >> >> recovering. > >> > >> >> + */ > >> > >> >> + rc = -1; > >> > >> >> + goto done; > >> > >> >> + } > >> > >> >> + > >> > >> >> /* This function may be called both in the recovery thread and > >> > >> >> in the > >> > >> >> * client thread. A possible scenario is that the recovery 2 > >> > >> >> thread > >> > >> >> * times out and calls this function in recovery mode 2 while > >> > >> >> there > >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client( > >> > >> >> lga_client_hdl_rec_t *rm_node > >> > >> >> ) > >> > >> >> { > >> > >> >> - TRACE_ENTER2(); > >> > >> >> - uint32_t ncs_rc; > >> > >> >> + return lga_hdl_rec_del(list_head, rm_node); > >> > >> >> +} > >> > >> >> > >> > >> >> +/** > >> > >> >> + * Free the memory allocated for one client handle with mutext > >> > >> protection. > >> > >> >> + * > >> > >> >> + * @param p_client_hdl > >> > >> >> + * @return void > >> > >> >> + */ > >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl) > >> > >> >> +{ > >> > >> >> + lga_client_hdl_rec_t *client_hdl; > >> > >> >> + > >> > >> >> + /** > >> > >> >> + * To prevent race condition b/w client thread and recovery > >> > >> >> thread > >> > >> >> + * accessing to common resource `client_list`. [#1705] > >> > >> >> + */ > >> > >> >> osaf_mutex_lock_ordie(&lga_recov_mutex); > >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node); > >> > >> >> + > >> > >> >> + client_hdl = *p_client_hdl; > >> > >> >> + if (client_hdl == NULL) goto done; > >> > >> >> + > >> > >> >> + free(client_hdl); > >> > >> >> + client_hdl = NULL; > >> > >> >> + > >> > >> >> +done: > >> > >> >> osaf_mutex_unlock_ordie(&lga_recov_mutex); > >> > >> >> > >> > >> >> - TRACE_LEAVE(); > >> > >> >> - return ncs_rc; > >> > >> >> } > >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h > >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h > >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h > >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h > >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client( > >> > >> >> lga_client_hdl_rec_t **list_head, > >> > >> >> lga_client_hdl_rec_t *rm_node > >> > >> >> ); > >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl); > >> > >> >> > >> > >> >> #ifdef __cplusplus > >> > >> >> } > >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c > >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c > >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c > >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c > >> > >> >> @@ -19,6 +19,7 @@ > >> > >> >> #include <syslog.h> > >> > >> >> #include "lga.h" > >> > >> >> #include "osaf_poll.h" > >> > >> >> +#include "lga_state.h" > >> > >> >> > >> > >> >> /* Variables used during startup/shutdown only */ > >> > >> >> static pthread_mutex_t lga_lock = > PTHREAD_MUTEX_INITIALIZER; > >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec > >> > >> >> /** clean up the channel records for this lga-client > >> > >> >> **/ > >> > >> >> lga_log_stream_hdl_rec_list_del(&client_hdl- > >> > >> >> >stream_list); > >> > >> >> - /** remove the association with hdl-mngr > >> > >> >> - **/ > >> > >> >> - free(client_hdl); > >> > >> >> - client_hdl = 0; > >> > >> >> + free_client_hdl(&client_hdl); > >> > >> >> } > >> > >> >> + > >> > >> >> TRACE_LEAVE(); > >> > >> >> } > >> > >> >> > >> > >> >> @@ -536,38 +535,35 @@ uint32_t > lga_hdl_rec_del(lga_client_hdl_ > >> > >> >> if (list_iter == rm_node) { > >> > >> >> *list_head = rm_node->next; > >> > >> >> > >> > >> >> - /** detach & release the IPC > >> > >> >> - **/ > >> > >> >> + /* detach & release the IPC */ > >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, > >> > >> >> lga_clear_mbx, NULL); > >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); > >> > >> >> > >> > >> >> ncshm_give_hdl(rm_node->local_hdl); > >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, > >> > >> >> rm_node->local_hdl); > >> > >> >> - /** Free the channel records off this hdl > >> > >> >> - **/ > >> > >> >> + > >> > >> >> + /* Free the channel records off this hdl */ > >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- > >> > >> >> >stream_list); > >> > >> >> + free_client_hdl(&rm_node); > >> > >> >> > >> > >> >> - /** free the hdl rec > >> > >> >> - **/ > >> > >> >> - free(rm_node); > >> > >> >> rc = NCSCC_RC_SUCCESS; > >> > >> >> goto out; > >> > >> >> - } else { /* find the rec */ > >> > >> >> + } else { > >> > >> >> + /* find the rec */ > >> > >> >> while (NULL != list_iter) { > >> > >> >> if (list_iter->next == rm_node) { > >> > >> >> list_iter->next = > >> > >> >> rm_node->next; > >> > >> >> > >> > >> >> - /** detach & release the IPC */ > >> > >> >> + /* detach & release > >> > >> >> the IPC */ > >> > >> >> > >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx, > >> > >> NULL); > >> > >> >> > >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); > >> > >> >> > >> > >> >> > >> > >> >> ncshm_give_hdl(rm_node->local_hdl); > >> > >> >> > >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node- > >> > >> >> >local_hdl); > >> > >> >> - /** Free the channel records off this lga_hdl */ > >> > >> >> + > >> > >> >> + /* Free the channel > >> > >> >> records off this lga_hdl */ > >> > >> >> > >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node->stream_list); > >> > >> >> - > >> > >> >> - /** free the hdl rec */ > >> > >> >> - free(rm_node); > >> > >> >> + > >> > >> >> free_client_hdl(&rm_node); > >> > >> >> > >> > >> >> rc = > >> > >> >> NCSCC_RC_SUCCESS; > >> > >> >> goto out; > >> > >> > >> > > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel