Hi all, Have you had time to look at this? Thanks.
Regards, Vu. >-----Original Message----- >From: Vu Minh Nguyen [mailto:[email protected]] >Sent: Wednesday, April 06, 2016 12:44 PM >To: 'Lennart Lund' >Cc: '[email protected]'; '[email protected]' >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in >log agent [#1705] > >Hi Lennart, > >Thanks for your comments. > >I think it is good idea to pick the lgs_state & lga_state out of lga_cb, >and use the mutex cb_lock to protect changes in lga control block. > >I reflect these things in attached file. > >And to make the code a bit clean, I have defined set/get functions for >lgs_state >& lga_state. >Also, I declare the mutex cb_lock as recursive one to avoid possible >recursively >lock in one thread. > >Could you have a look and give your comments? > >Regards, Vu. > >>-----Original Message----- >>From: Lennart Lund [mailto:[email protected]] >>Sent: Tuesday, April 05, 2016 6:59 PM >>To: Vu Minh Nguyen >>Cc: [email protected]; [email protected]; Lennart >>Lund >>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in >>log agent [#1705] >> >>Hi Vu >> >>See my comments inline [Lennart] >> >>Note. My comments does not give the complete solution. For that I need to >>use more time but may give you some ideas. >> >>/Lennart >> >>> -----Original Message----- >>> From: Vu Minh Nguyen [mailto:[email protected]] >>> Sent: den 5 april 2016 10:48 >>> To: Lennart Lund >>> Cc: [email protected]; [email protected] >>> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource >>> in log agent [#1705] >>> >>> Hi Lennart, >>> >>> Thanks for your proposal. >>> >>> Still, there is race condition b/w mds thread and recover/client thread, I >>> think. >>> 1) Headless while recover thread 2 is running >>> 2) Headless while the client thread is accessing/modifying to client list. >>> >>[Lennart] In the mds thread the lga_no_server_state_set() is called if server >>down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped >>and recovery state is set to state LGA_NO_SERVER. This means that we start >all >>over again with recovery state LGA_RECOVERY1 when the server comes back >>again. >>Within the thread the cb lock is taken while the client list is deleted if >needed. >>With a new mutex for lga_state this mutex should also be used in all other >>places where lga_state is handled not the cb_lock. This means that the name >>"recovery2" mutex is not so good instead there should be a lga_state mutex. >>The lga_state variable may also be moved outside the lga_cb structure. >> >>> I would like to propose other one, using only one mutex cb_lock @lga_cb >>> to protect common `lga_cb` resource, including `state`, `client` list, etc. >>> in >>> three threads. >>> >>[Lennart] The cb_lock is used to protect variables in the cb structure but the >>lga_state mutex prevent recovery or API operations from being done at the >>same time. >> >>> 1) Client thread >>> - Lock mutex before getting the state >>> - Unlock right before the API operation is finished >>> >>[Lennart] Yes but the lga_state mutex >> >>> 2) Recovery2 thread >>> - Lock mutex after timeout >>> - Unlock before ending the thread >>> >>[Lennart] No this will cause the API to be locked as long as the recovery 2 >>thread is running and this is not wanted. The API shall work also while the >>recovery thread is running and most of the APIs shall return with >>SA_AIS_TRY_AGAIN during recovery 2 state. >> >>> 3) MDS thread >>> - Lock/unlock before/after changing state >>> In case of headless: >>> - Lock/unlock mutex before modifying `client_list` in >>> lga_no_server_state_set() >>> >>[Lennart] lga_state is not changed directly in the mds thread but indirectly >>when the lga_serv_recov1state_set() and lga_no_server_state_set() functions >>are called. I these functions the recovery thread and state change is handled >>and the mutex used shall be changed to the lga_state mutex. >> >>The client data should be protected by the cb_lock mutex in all places which >is >>not the case. There is no protection in the mds trhead in the >>lga_lgs_msg_proc() where client data is used via the >>ga_find_hdl_rec_by_regid() function. This function should lock the lga_cb lock >>while handling the lga_cb client data. >>Also there are two functions in lga_state.c that are handling client data >locking >>a lga_recov_mutex. This does not protect client data handled in the mds >>thread. >> >>> What is your opinion? Do you think there is impact to log client if they >>> call >>> LOG API in multi threads? >>> >>[Lennart] The mutex handling should be cleaned up. >> - Recovery state should be protected by its own mutex so that APIs can work >>in respective state without locking. >> - Client data should be protected in all threads by the lga_cb lock. >> - The lga_recov_mutex can probably be removed >> >>> Regards, Vu. >>> >>> >>> >-----Original Message----- >>> >From: Lennart Lund [mailto:[email protected]] >>> >Sent: Monday, April 04, 2016 3:51 PM >>> >To: Vu Minh Nguyen >>> >Cc: [email protected]; [email protected]; >>> Lennart >>> >Lund >>> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common >resource >>> in >>> >log agent [#1705] >>> > >>> >Hi Vu, >>> > >>> >A simpler solution than described below, maybe: >>> > >>> >Still use a separate "recovery2" mutex but a "normal one". >>> > - In the API, lock the mutex just before checking if we are in recovery >>> > state >>> 2 >>> >(before "if (lga_state == LGA_RECOVERY2) {") >>> > - In the API, do not unlock until the API operation is finished >>> > - In the recovery thread lock the mutex just before changing the recovery >>> >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2 >>> mutex) >>> >(protect the lga_state variable) >>> >- In the recovery thread the mutex shall be unlocked just after lga_state >>> >is >>> set >>> >to recovery state 2. >>> > >>> >This means that the recovery thread cannot set recovery state2 or execute >>> >until an eventual ongoing API operation is done. >>> >If the thread starts and set recovery state just before the API checks >>> recovery >>> >state the API will correctly see that we are in recovery state 2 and handle >>> thing >>> >accordingly. >>> > >>> >Regards >>> >Lennart >>> > >>> > >>> >> -----Original Message----- >>> >> From: Lennart Lund >>> >> Sent: den 1 april 2016 16:51 >>> >> To: Vu Minh Nguyen >>> >> Cc: [email protected]; [email protected]; >>> >> Lennart Lund >>> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common >>> resource >>> >> in log agent [#1705] >>> >> >>> >> Hi Vu, >>> >> >>> >> You are right, good finding, this may happen since the recovery2 thread >is >>> >> started and is waiting for a timeout. If this timeout happen at the >>> >> "wrong >>> >> time" bad things may happen. >>> >> However the actual problem should be solved. >>> >> >>> >> This is how it could be done (maybe you can find a better way?): >>> >> When timeout the recovery thread shall set recovery state 2 as is but >shall >>> >> not start any recovery if there is any ongoing client activity. >>> >> This could be solved by using a "recovery2" mutex (not the existing >>> >> lga_cb.cb_lock). >>> >> - The recovery thread lock this mutex after setting recovery state 2. >>> >> - During recovery state 1 all APIs check if the mutex is taken and if so >>> return >>> >> TRY AGAIN (this means that we are in recovery state 2 but the state >>> change >>> >> may have happened after the check in the API). Note: "trylock" cannot be >>> >> used for this purpose since a sequence trylock -> lock is not atomic. >>> Maybe a >>> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used. >>> >> - If the mutex is not taken it shall be locked by the API function and >>> >> be >>> >> unlocked just before the API function returns preventing the recovery >>> >> thread from starting recovery during an ongoing API operation. >>> >> The next time an API is called recovery state 2 is set and the API >>> >> function >>> will >>> >> operate accordingly. >>> >> >>> >> Thanks >>> >> Lennart >>> >> >>> >> > -----Original Message----- >>> >> > From: Vu Minh Nguyen [mailto:[email protected]] >>> >> > Sent: den 1 april 2016 16:02 >>> >> > To: Lennart Lund >>> >> > Cc: [email protected]; [email protected] >>> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >>> >> resource >>> >> > in log agent [#1705] >>> >> > >>> >> > Hi Lennart, >>> >> > >>> >> > Is it possible for following case happens? >>> >> > >>> >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG >>> >> > service up, recovery thread starts and >>> >> > comes to the point right after "while (pclient != NULL)" in >>> >> > recovery2_thread @lga_state.c >>> >> > Then, the context is switching to the client thread and it deletes the >>> >> > resource while recovery2_thread is running? >>> >> > >>> >> > Regards, Vu. >>> >> > >>> >> > Quoting Lennart Lund <[email protected]>: >>> >> > >>> >> > > Hi Vu, >>> >> > > >>> >> > > /* Recover clients one at a time */ >>> >> > > rc = recover_one_client(p_client); >>> >> > > + if (p_client == NULL) goto done; >>> >> > > + >>> >> > > [Lennart] Why is this needed? This check is done in "while >>> >> > > (p_client != NULL) { " >>> >> > > and p_client cannot become NULL until "p_client = p_client->next; " >>> >> > > is done as >>> >> > > the last line in the while loop. >>> >> > > >>> >> > > [Vu] It is my intention. I am afraid if the context is switched to >>> >> > > client thread >>> >> > > right after the recovery thread enters the while loop. >>> >> > > If it is the case, p_client will be NULL after done >>> >> > > recover_one_client(). >>> >> > > >>> >> > > [Lennart] I don't really understand your explanation. What is >>> >> > > happening in the client thread when we are in recovery state 2 that >>> >> > > can set p_client == NULL? The client thread should not be able to do >>> >> > > anything with the data involved when recovering client in the >>> >> > > recovery thread? >>> >> > > >>> >> > > Regards >>> >> > > Lennart >>> >> > > >>> >> > >> -----Original Message----- >>> >> > >> From: Vu Minh Nguyen [mailto:[email protected]] >>> >> > >> Sent: den 1 april 2016 14:14 >>> >> > >> To: Lennart Lund >>> >> > >> Cc: [email protected]; [email protected] >>> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >>> >> > resource >>> >> > >> in log agent [#1705] >>> >> > >> >>> >> > >> Hi Lennart, >>> >> > >> >>> >> > >> Please see my comment inline. >>> >> > >> >>> >> > >> Regards, Vu >>> >> > >> >>> >> > >> Quoting Lennart Lund <[email protected]>: >>> >> > >> >>> >> > >> > Hi Vu >>> >> > >> > >>> >> > >> > Ack with comments >>> >> > >> > >>> >> > >> > See my comments inline [Lennart]: >>> >> > >> > >>> >> > >> > Thanks >>> >> > >> > Lennart >>> >> > >> > >>> >> > >> >> -----Original Message----- >>> >> > >> >> From: Vu Minh Nguyen [mailto:[email protected]] >>> >> > >> >> Sent: den 23 mars 2016 08:16 >>> >> > >> >> To: [email protected]; Lennart Lund >>> >> > >> >> Cc: [email protected] >>> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common >>> >> > resource >>> >> > >> in >>> >> > >> >> log agent [#1705] >>> >> > >> >> >>> >> > >> >> osaf/libs/agents/saf/lga/lga_state.c | 41 >>> >> > >> >> +++++++++++++++++++++++++++++++---- >>> >> > >> >> osaf/libs/agents/saf/lga/lga_state.h | 1 + >>> >> > >> >> osaf/libs/agents/saf/lga/lga_util.c | 30 >>> >> > >> >> +++++++++++------------- >>> - >>> >> > >> >> 3 files changed, 50 insertions(+), 22 deletions(-) >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> There was an race condition between client thread and recovery >>> >> > thread >>> >> > >> >> accessing to common resource `client_list`. >>> >> > >> >> >>> >> > >> >> Add mutex protection. >>> >> > >> >> >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c >>> >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void >>> *dumm >>> >> > >> >> /* We have been signaled to exit >>> >> > >> >> */ >>> >> > >> >> goto done; >>> >> > >> >> } >>> >> > >> >> + >>> >> > >> >> /* Recover clients one at a time */ >>> >> > >> >> rc = recover_one_client(p_client); >>> >> > >> >> + if (p_client == NULL) goto done; >>> >> > >> >> + >>> >> > >> > [Lennart] Why is this needed? This check is done in "while >(p_client >>> >> > >> > != NULL) { " and p_client cannot become NULL until "p_client = >>> >> > >> > p_client->next; " is done as the last line in the while loop. >>> >> > >> > >>> >> > >> [Vu] It is my intention. I am afraid if the context is switched to >>> >> > >> client thread >>> >> > >> right after the recovery thread enters the while loop. >>> >> > >> If it is the case, p_client will be NULL after done >>> recover_one_client(). >>> >> > >> >>> >> > >> >> TRACE("\t Client %d is recovered", p_client- >>> >> > >> >> >lgs_client_id); >>> >> > >> >> if (rc == -1) { >>> >> > >> >> TRACE("%s recover_one_client >>> >> > >> >> Fail Deleting cllient (id %d)", >>> >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re >>> >> > >> >> >>> >> > >> >> TRACE_ENTER(); >>> >> > >> >> >>> >> > >> >> + if (p_client == NULL) { >>> >> > >> >> + /** >>> >> > >> >> + * Probably the client_list has been deleted by >>> >> > >> >> client thread >>> >> > >> >> + * in context calling saLogFinalize(). So, give >>> >> > >> >> up >>> >> > >> >> recovering. >>> >> > >> >> + */ >>> >> > >> >> + rc = -1; >>> >> > >> >> + goto done; >>> >> > >> >> + } >>> >> > >> >> + >>> >> > >> >> /* This function may be called both in the recovery >>> >> > >> >> thread and >>> >> > >> >> in the >>> >> > >> >> * client thread. A possible scenario is that the >>> >> > >> >> recovery 2 >>> >> > >> >> thread >>> >> > >> >> * times out and calls this function in recovery mode 2 >>> >> > >> >> while >>> >> > >> >> there >>> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client( >>> >> > >> >> lga_client_hdl_rec_t *rm_node >>> >> > >> >> ) >>> >> > >> >> { >>> >> > >> >> - TRACE_ENTER2(); >>> >> > >> >> - uint32_t ncs_rc; >>> >> > >> >> + return lga_hdl_rec_del(list_head, rm_node); >>> >> > >> >> +} >>> >> > >> >> >>> >> > >> >> +/** >>> >> > >> >> + * Free the memory allocated for one client handle with mutext >>> >> > >> protection. >>> >> > >> >> + * >>> >> > >> >> + * @param p_client_hdl >>> >> > >> >> + * @return void >>> >> > >> >> + */ >>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl) >>> >> > >> >> +{ >>> >> > >> >> + lga_client_hdl_rec_t *client_hdl; >>> >> > >> >> + >>> >> > >> >> + /** >>> >> > >> >> + * To prevent race condition b/w client thread and >>> >> > >> >> recovery >>> >> > >> >> thread >>> >> > >> >> + * accessing to common resource `client_list`. [#1705] >>> >> > >> >> + */ >>> >> > >> >> osaf_mutex_lock_ordie(&lga_recov_mutex); >>> >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node); >>> >> > >> >> + >>> >> > >> >> + client_hdl = *p_client_hdl; >>> >> > >> >> + if (client_hdl == NULL) goto done; >>> >> > >> >> + >>> >> > >> >> + free(client_hdl); >>> >> > >> >> + client_hdl = NULL; >>> >> > >> >> + >>> >> > >> >> +done: >>> >> > >> >> osaf_mutex_unlock_ordie(&lga_recov_mutex); >>> >> > >> >> >>> >> > >> >> - TRACE_LEAVE(); >>> >> > >> >> - return ncs_rc; >>> >> > >> >> } >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h >>> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client( >>> >> > >> >> lga_client_hdl_rec_t **list_head, >>> >> > >> >> lga_client_hdl_rec_t *rm_node >>> >> > >> >> ); >>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl); >>> >> > >> >> >>> >> > >> >> #ifdef __cplusplus >>> >> > >> >> } >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c >>> >> > >> >> @@ -19,6 +19,7 @@ >>> >> > >> >> #include <syslog.h> >>> >> > >> >> #include "lga.h" >>> >> > >> >> #include "osaf_poll.h" >>> >> > >> >> +#include "lga_state.h" >>> >> > >> >> >>> >> > >> >> /* Variables used during startup/shutdown only */ >>> >> > >> >> static pthread_mutex_t lga_lock = >>> PTHREAD_MUTEX_INITIALIZER; >>> >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec >>> >> > >> >> /** clean up the channel records for this lga-client >>> >> > >> >> **/ >>> >> > >> >> lga_log_stream_hdl_rec_list_del(&client_hdl- >>> >> > >> >> >stream_list); >>> >> > >> >> - /** remove the association with hdl-mngr >>> >> > >> >> - **/ >>> >> > >> >> - free(client_hdl); >>> >> > >> >> - client_hdl = 0; >>> >> > >> >> + free_client_hdl(&client_hdl); >>> >> > >> >> } >>> >> > >> >> + >>> >> > >> >> TRACE_LEAVE(); >>> >> > >> >> } >>> >> > >> >> >>> >> > >> >> @@ -536,38 +535,35 @@ uint32_t >>> lga_hdl_rec_del(lga_client_hdl_ >>> >> > >> >> if (list_iter == rm_node) { >>> >> > >> >> *list_head = rm_node->next; >>> >> > >> >> >>> >> > >> >> - /** detach & release the IPC >>> >> > >> >> - **/ >>> >> > >> >> + /* detach & release the IPC */ >>> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, >>> >> > >> >> lga_clear_mbx, NULL); >>> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >>> >> > >> >> >>> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >>> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, >>> >> > >> >> rm_node->local_hdl); >>> >> > >> >> - /** Free the channel records off this hdl >>> >> > >> >> - **/ >>> >> > >> >> + >>> >> > >> >> + /* Free the channel records off this hdl */ >>> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- >>> >> > >> >> >stream_list); >>> >> > >> >> + free_client_hdl(&rm_node); >>> >> > >> >> >>> >> > >> >> - /** free the hdl rec >>> >> > >> >> - **/ >>> >> > >> >> - free(rm_node); >>> >> > >> >> rc = NCSCC_RC_SUCCESS; >>> >> > >> >> goto out; >>> >> > >> >> - } else { /* find the rec */ >>> >> > >> >> + } else { >>> >> > >> >> + /* find the rec */ >>> >> > >> >> while (NULL != list_iter) { >>> >> > >> >> if (list_iter->next == rm_node) { >>> >> > >> >> list_iter->next = >>> >> > >> >> rm_node->next; >>> >> > >> >> >>> >> > >> >> - /** detach & release the IPC */ >>> >> > >> >> + /* detach & release >>> >> > >> >> the IPC */ >>> >> > >> >> >>> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx, >>> >> > >> NULL); >>> >> > >> >> >>> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >>> >> > >> >> >>> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node- >>> >> > >> >> >local_hdl); >>> >> > >> >> - /** Free the channel records off this lga_hdl >>> >> > >> >> */ >>> >> > >> >> + >>> >> > >> >> + /* Free the channel >>> >> > >> >> records off this lga_hdl */ >>> >> > >> >> >>> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node->stream_list); >>> >> > >> >> - >>> >> > >> >> - /** free the hdl rec */ >>> >> > >> >> - free(rm_node); >>> >> > >> >> + >>> >> > >> >> free_client_hdl(&rm_node); >>> >> > >> >> >>> >> > >> >> rc = >>> >> > >> >> NCSCC_RC_SUCCESS; >>> >> > >> >> goto out; >>> >> > >> >>> >> > >>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
