Hi Lennart, Thanks for your proposal.
Still, there is race condition b/w mds thread and recover/client thread, I think. 1) Headless while recover thread 2 is running 2) Headless while the client thread is accessing/modifying to client list. I would like to propose other one, using only one mutex cb_lock @lga_cb to protect common `lga_cb` resource, including `state`, `client` list, etc. in three threads. 1) Client thread - Lock mutex before getting the state - Unlock right before the API operation is finished 2) Recovery2 thread - Lock mutex after timeout - Unlock before ending the thread 3) MDS thread - Lock/unlock before/after changing state In case of headless: - Lock/unlock mutex before modifying `client_list` in lga_no_server_state_set() What is your opinion? Do you think there is impact to log client if they call LOG API in multi threads? Regards, Vu. >-----Original Message----- >From: Lennart Lund [mailto:lennart.l...@ericsson.com] >Sent: Monday, April 04, 2016 3:51 PM >To: Vu Minh Nguyen >Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; Lennart >Lund >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in >log agent [#1705] > >Hi Vu, > >A simpler solution than described below, maybe: > >Still use a separate "recovery2" mutex but a "normal one". > - In the API, lock the mutex just before checking if we are in recovery state > 2 >(before "if (lga_state == LGA_RECOVERY2) {") > - In the API, do not unlock until the API operation is finished > - In the recovery thread lock the mutex just before changing the recovery >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2 mutex) >(protect the lga_state variable) >- In the recovery thread the mutex shall be unlocked just after lga_state is >set >to recovery state 2. > >This means that the recovery thread cannot set recovery state2 or execute >until an eventual ongoing API operation is done. >If the thread starts and set recovery state just before the API checks recovery >state the API will correctly see that we are in recovery state 2 and handle >thing >accordingly. > >Regards >Lennart > > >> -----Original Message----- >> From: Lennart Lund >> Sent: den 1 april 2016 16:51 >> To: Vu Minh Nguyen >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; >> Lennart Lund >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource >> in log agent [#1705] >> >> Hi Vu, >> >> You are right, good finding, this may happen since the recovery2 thread is >> started and is waiting for a timeout. If this timeout happen at the "wrong >> time" bad things may happen. >> However the actual problem should be solved. >> >> This is how it could be done (maybe you can find a better way?): >> When timeout the recovery thread shall set recovery state 2 as is but shall >> not start any recovery if there is any ongoing client activity. >> This could be solved by using a "recovery2" mutex (not the existing >> lga_cb.cb_lock). >> - The recovery thread lock this mutex after setting recovery state 2. >> - During recovery state 1 all APIs check if the mutex is taken and if so >> return >> TRY AGAIN (this means that we are in recovery state 2 but the state change >> may have happened after the check in the API). Note: "trylock" cannot be >> used for this purpose since a sequence trylock -> lock is not atomic. Maybe a >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used. >> - If the mutex is not taken it shall be locked by the API function and be >> unlocked just before the API function returns preventing the recovery >> thread from starting recovery during an ongoing API operation. >> The next time an API is called recovery state 2 is set and the API function >> will >> operate accordingly. >> >> Thanks >> Lennart >> >> > -----Original Message----- >> > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] >> > Sent: den 1 april 2016 16:02 >> > To: Lennart Lund >> > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >> resource >> > in log agent [#1705] >> > >> > Hi Lennart, >> > >> > Is it possible for following case happens? >> > >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG >> > service up, recovery thread starts and >> > comes to the point right after "while (pclient != NULL)" in >> > recovery2_thread @lga_state.c >> > Then, the context is switching to the client thread and it deletes the >> > resource while recovery2_thread is running? >> > >> > Regards, Vu. >> > >> > Quoting Lennart Lund <lennart.l...@ericsson.com>: >> > >> > > Hi Vu, >> > > >> > > /* Recover clients one at a time */ >> > > rc = recover_one_client(p_client); >> > > + if (p_client == NULL) goto done; >> > > + >> > > [Lennart] Why is this needed? This check is done in "while >> > > (p_client != NULL) { " >> > > and p_client cannot become NULL until "p_client = p_client->next; " >> > > is done as >> > > the last line in the while loop. >> > > >> > > [Vu] It is my intention. I am afraid if the context is switched to >> > > client thread >> > > right after the recovery thread enters the while loop. >> > > If it is the case, p_client will be NULL after done >> > > recover_one_client(). >> > > >> > > [Lennart] I don't really understand your explanation. What is >> > > happening in the client thread when we are in recovery state 2 that >> > > can set p_client == NULL? The client thread should not be able to do >> > > anything with the data involved when recovering client in the >> > > recovery thread? >> > > >> > > Regards >> > > Lennart >> > > >> > >> -----Original Message----- >> > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] >> > >> Sent: den 1 april 2016 14:14 >> > >> To: Lennart Lund >> > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >> > resource >> > >> in log agent [#1705] >> > >> >> > >> Hi Lennart, >> > >> >> > >> Please see my comment inline. >> > >> >> > >> Regards, Vu >> > >> >> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>: >> > >> >> > >> > Hi Vu >> > >> > >> > >> > Ack with comments >> > >> > >> > >> > See my comments inline [Lennart]: >> > >> > >> > >> > Thanks >> > >> > Lennart >> > >> > >> > >> >> -----Original Message----- >> > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] >> > >> >> Sent: den 23 mars 2016 08:16 >> > >> >> To: mathi.naic...@oracle.com; Lennart Lund >> > >> >> Cc: opensaf-devel@lists.sourceforge.net >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common >> > resource >> > >> in >> > >> >> log agent [#1705] >> > >> >> >> > >> >> osaf/libs/agents/saf/lga/lga_state.c | 41 >> > >> >> +++++++++++++++++++++++++++++++---- >> > >> >> osaf/libs/agents/saf/lga/lga_state.h | 1 + >> > >> >> osaf/libs/agents/saf/lga/lga_util.c | 30 +++++++++++-------------- >> > >> >> 3 files changed, 50 insertions(+), 22 deletions(-) >> > >> >> >> > >> >> >> > >> >> There was an race condition between client thread and recovery >> > thread >> > >> >> accessing to common resource `client_list`. >> > >> >> >> > >> >> Add mutex protection. >> > >> >> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void *dumm >> > >> >> /* We have been signaled to exit >> > >> >> */ >> > >> >> goto done; >> > >> >> } >> > >> >> + >> > >> >> /* Recover clients one at a time */ >> > >> >> rc = recover_one_client(p_client); >> > >> >> + if (p_client == NULL) goto done; >> > >> >> + >> > >> > [Lennart] Why is this needed? This check is done in "while (p_client >> > >> > != NULL) { " and p_client cannot become NULL until "p_client = >> > >> > p_client->next; " is done as the last line in the while loop. >> > >> > >> > >> [Vu] It is my intention. I am afraid if the context is switched to >> > >> client thread >> > >> right after the recovery thread enters the while loop. >> > >> If it is the case, p_client will be NULL after done >> > >> recover_one_client(). >> > >> >> > >> >> TRACE("\t Client %d is recovered", p_client- >> > >> >> >lgs_client_id); >> > >> >> if (rc == -1) { >> > >> >> TRACE("%s recover_one_client >> > >> >> Fail Deleting cllient (id %d)", >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re >> > >> >> >> > >> >> TRACE_ENTER(); >> > >> >> >> > >> >> + if (p_client == NULL) { >> > >> >> + /** >> > >> >> + * Probably the client_list has been deleted by >> > >> >> client thread >> > >> >> + * in context calling saLogFinalize(). So, give up >> > >> >> recovering. >> > >> >> + */ >> > >> >> + rc = -1; >> > >> >> + goto done; >> > >> >> + } >> > >> >> + >> > >> >> /* This function may be called both in the recovery thread and >> > >> >> in the >> > >> >> * client thread. A possible scenario is that the recovery 2 >> > >> >> thread >> > >> >> * times out and calls this function in recovery mode 2 while >> > >> >> there >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client( >> > >> >> lga_client_hdl_rec_t *rm_node >> > >> >> ) >> > >> >> { >> > >> >> - TRACE_ENTER2(); >> > >> >> - uint32_t ncs_rc; >> > >> >> + return lga_hdl_rec_del(list_head, rm_node); >> > >> >> +} >> > >> >> >> > >> >> +/** >> > >> >> + * Free the memory allocated for one client handle with mutext >> > >> protection. >> > >> >> + * >> > >> >> + * @param p_client_hdl >> > >> >> + * @return void >> > >> >> + */ >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl) >> > >> >> +{ >> > >> >> + lga_client_hdl_rec_t *client_hdl; >> > >> >> + >> > >> >> + /** >> > >> >> + * To prevent race condition b/w client thread and recovery >> > >> >> thread >> > >> >> + * accessing to common resource `client_list`. [#1705] >> > >> >> + */ >> > >> >> osaf_mutex_lock_ordie(&lga_recov_mutex); >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node); >> > >> >> + >> > >> >> + client_hdl = *p_client_hdl; >> > >> >> + if (client_hdl == NULL) goto done; >> > >> >> + >> > >> >> + free(client_hdl); >> > >> >> + client_hdl = NULL; >> > >> >> + >> > >> >> +done: >> > >> >> osaf_mutex_unlock_ordie(&lga_recov_mutex); >> > >> >> >> > >> >> - TRACE_LEAVE(); >> > >> >> - return ncs_rc; >> > >> >> } >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client( >> > >> >> lga_client_hdl_rec_t **list_head, >> > >> >> lga_client_hdl_rec_t *rm_node >> > >> >> ); >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl); >> > >> >> >> > >> >> #ifdef __cplusplus >> > >> >> } >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c >> > >> >> @@ -19,6 +19,7 @@ >> > >> >> #include <syslog.h> >> > >> >> #include "lga.h" >> > >> >> #include "osaf_poll.h" >> > >> >> +#include "lga_state.h" >> > >> >> >> > >> >> /* Variables used during startup/shutdown only */ >> > >> >> static pthread_mutex_t lga_lock = PTHREAD_MUTEX_INITIALIZER; >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec >> > >> >> /** clean up the channel records for this lga-client >> > >> >> **/ >> > >> >> lga_log_stream_hdl_rec_list_del(&client_hdl- >> > >> >> >stream_list); >> > >> >> - /** remove the association with hdl-mngr >> > >> >> - **/ >> > >> >> - free(client_hdl); >> > >> >> - client_hdl = 0; >> > >> >> + free_client_hdl(&client_hdl); >> > >> >> } >> > >> >> + >> > >> >> TRACE_LEAVE(); >> > >> >> } >> > >> >> >> > >> >> @@ -536,38 +535,35 @@ uint32_t lga_hdl_rec_del(lga_client_hdl_ >> > >> >> if (list_iter == rm_node) { >> > >> >> *list_head = rm_node->next; >> > >> >> >> > >> >> - /** detach & release the IPC >> > >> >> - **/ >> > >> >> + /* detach & release the IPC */ >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, >> > >> >> lga_clear_mbx, NULL); >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >> > >> >> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, >> > >> >> rm_node->local_hdl); >> > >> >> - /** Free the channel records off this hdl >> > >> >> - **/ >> > >> >> + >> > >> >> + /* Free the channel records off this hdl */ >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- >> > >> >> >stream_list); >> > >> >> + free_client_hdl(&rm_node); >> > >> >> >> > >> >> - /** free the hdl rec >> > >> >> - **/ >> > >> >> - free(rm_node); >> > >> >> rc = NCSCC_RC_SUCCESS; >> > >> >> goto out; >> > >> >> - } else { /* find the rec */ >> > >> >> + } else { >> > >> >> + /* find the rec */ >> > >> >> while (NULL != list_iter) { >> > >> >> if (list_iter->next == rm_node) { >> > >> >> list_iter->next = >> > >> >> rm_node->next; >> > >> >> >> > >> >> - /** detach & release the IPC */ >> > >> >> + /* detach & release >> > >> >> the IPC */ >> > >> >> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx, >> > >> NULL); >> > >> >> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >> > >> >> >> > >> >> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >> > >> >> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node- >> > >> >> >local_hdl); >> > >> >> - /** Free the channel records off this lga_hdl */ >> > >> >> + >> > >> >> + /* Free the channel >> > >> >> records off this lga_hdl */ >> > >> >> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node->stream_list); >> > >> >> - >> > >> >> - /** free the hdl rec */ >> > >> >> - free(rm_node); >> > >> >> + >> > >> >> free_client_hdl(&rm_node); >> > >> >> >> > >> >> rc = >> > >> >> NCSCC_RC_SUCCESS; >> > >> >> goto out; >> > >> >> > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel