Hi Vu, Ack with a minor comment
- The _e suffix for typedef is not used in OpenSAF should be _t e.g. lgs_state_e -> lgs_state_t Thanks Lennart > -----Original Message----- > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > Sent: den 20 april 2016 12:16 > To: Lennart Lund; mathi.naic...@oracle.com > Cc: opensaf-devel@lists.sourceforge.net > Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource > in log agent [#1705] > > Hi Lennart, > > Thanks. I have updated some code in your patch to fix > problems/enhancement: > > 1) Recursively lock mutex `cb_lock` in recovery & client thread > eg: lock in `lga_recover_one_client()` -> lock in > `recover_one_stream()`/`initialize_one_client()`. > > 2) Race condition b/w client thread & mds thread. > e.g: > - `lga_recover_one_client()` vs `lga_no_server_state_set()` > - Free client list vs `lga_no_server_state_set()` > > 3) Make the recovery thread joinable & adding pthread_exit(). > > 4) Using osaf function osaf_mutex_lock_ordie/osaf_mutex_unlock_ordie > for lock/unlock mutex instead of using pthread_mutex_lock/unlock. > > 5) Introduce set/check functions with thread safe for lga_state/lgs_state. > > Please have a look and give your comments. Thanks. > > Note: > I ran test several times. > > Regards, Vu. > > >-----Original Message----- > >From: Lennart Lund [mailto:lennart.l...@ericsson.com] > >Sent: Tuesday, April 19, 2016 6:13 PM > >To: Vu Minh Nguyen; mathi.naic...@oracle.com > >Cc: opensaf-devel@lists.sourceforge.net > >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource > in > >log agent [#1705] > > > >Hi Vu > > > >I am not surprised that you have found problems in my patch. It was > created in > >half an hour and not very much tested. > >It is mostly an idea of a strategy to handle the problems. I have refined it > >a > bit > >hopefully solving the problems you have found so far. > >Attached is the refined version. > >The last word on how to fix this is however yours. > > > >See my comments inline [Lennart] > > > >Thanks > >Lennart > > > >> -----Original Message----- > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> Sent: den 19 april 2016 09:56 > >> To: Lennart Lund; mathi.naic...@oracle.com > >> Cc: opensaf-devel@lists.sourceforge.net > >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > resource > >> in log agent [#1705] > >> > >> Hi Lennart, > >> > >> I tried to run the test on 2 nodes cluster (SC-1 & PL-3): sending a record > using > >> `saflogger` in loop on PL-3, then reboot SC-1 frequently. > >> And I got following error message: > >> > >> # > >> *** Error in `/usr/local/bin/saflogger': munmap_chunk(): invalid pointer: > >> 0x00000000408d5cf8 *** > >> Aborted (core dumped) > >> # > >> > >> I read code and see following points: > >> > >> 1) Static data `is_locked` and ` recovery2_lock(void)/ recovery2_unlock()` > is > >> not thread safe. > >> Use case: Calling LOG API with different log handles in multi threads. > >> > >[Lennart] You are right. The problem is the is_locked variable. > >If the is_locked variable is a stack variable in the calling function and > >given to > >recovery2_lock() and recovery2_unlock() as an in parameter this should be > >thread safe. > > > > > >> 2) Conflict using resource `client_list` b/w recovery thread and MDS > thread. > >> Use case: recovery2 thread is executing (after timeout), headless occurs. > >> Then, mds thread will clear all flags while the recovery2 thread is > recovering > >> clients. > >> > >[Lennart] Yes, the thread must be terminated before handling the flags (or > any > >other client data). Can be solved by joining the thread after ordering it to > >terminate. > > > >> 3) Conflict using resource `client_list` b/w client thread and MDS thread. > >> Use case: client thread is recovering client, calling > `lga_recover_one_client`, > >> then headless occurs. > >> > >[Lennart] Yes, lga_recover_one_client() must be made thread safe > regarding > >the client data. Here the lga_cb lock should be used. > > > >> I think we can use only one mutex `cb_lock` to synchronize accessing > >> to common resource `lga_cb` among threads (client/recovery/mds). > >> And to simplify the code, take the lgs_state out of the lga_cb > >> same as lga_state, using set/get on `private` mutexes? > >> > >> > >[Lennart] The main thread problem here is to make handling of the global > >lga_cb structure including the client "database" thread safe but this is not > the > >only problem. Also API handling client requests and the recovery thread > must > >be synchronized and to use the lga_cb lock for this as well as for protecting > the > >cb data makes things complicated and error prone. > >Also the API must be responsive also while recovery is ongoing. > >The lgs_state has nothing to do with resilience and has been there from > before > >resilience was introduced (I think it had another name). > > > >> Regards, Vu. > >> > >> >-----Original Message----- > >> >From: Lennart Lund [mailto:lennart.l...@ericsson.com] > >> >Sent: Friday, April 15, 2016 4:01 PM > >> >To: Vu Minh Nguyen; mathi.naic...@oracle.com > >> >Cc: opensaf-devel@lists.sourceforge.net; Lennart Lund > >> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > resource > >> in > >> >log agent [#1705] > >> > > >> >Hi Vu, > >> > > >> >When reviewing this I found several problems. Instead of trying to > >> comment > >> >everything I have instead created a patch of my own. > >> > > >> >The main problem fixed by this patch is the handling of the critical > section > >> >where both the recovery2 thread and the client thread could handle the > >> same > >> >global client data in a conflicting way. > >> >The problem was that the recovery 2 thread could start while an API was > in > >> a > >> >section handling client data. > >> >This solution allows the APIs to return without waiting for the recovery 2 > >> >thread to finish. > >> > > >> >In the agent there are 3 mutexes: > >> >1. The cb_lock (legacy) > >> >This mutex is intended for protecting write/read of the global variables > in > >> the > >> >lga_cb structure. > >> >Shall be locked only when changing or reading data in the variables not > for > >> >locking other critical sections. > >> > > >> >2. The lga_lock (legacy) > >> >Local to lga_util.c > >> >Used to protect critical sections when handling some critical sections in > lga > >> >startup and shut down > >> > > >> >3. The lga_state_lock > >> >Local in lga_state.c > >> >Used for protecting the lga state. The lga_state variable is removed from > >> the > >> >lga_cb structure and is used locally in lga_state.c > >> >The lga_state is handled in the mds thread and in the recovery 2 thread. > >> > > >> >3. The lga_recov2_lock (new) > >> >Used to protect critical sections so that client APIs and the recovery 2 > >thread > >> >does not try to handle > >> >client data in a conflicting way. > >> > > >> >4. The lga_recov_mutex > >> >Is removed and replaced by the lga_recov2_lock > >> > > >> >NOTE1: > >> >The lga_state and the lgs_state that was broken out of the lga_cb > structure > >> >(but still were global) in your patch is not realted. > >> >The lgs_state is legacy but had another name before resilience > functionality > >> >was introduced. I have not moved lgs_state out of the lgs_cb structure. > >> >The lga_state is removed from the lga_cb structure and is handled > entirely > >> via > >> >lga_state functions > >> > > >> >NOTE2: > >> >I have run logtest with and without resilience activated and have run > both > >> >legacy and resilience test cases in UML cluster. > >> >I have not tested with helgrind. > >> > > >> >Please check my patch carefully and try to find out if there are any > >> problems. > >> >Also please run resilience tests with valgrind/helgrind > >> > > >> >Regards > >> >Lennart > >> > > >> >> -----Original Message----- > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> >> Sent: den 8 april 2016 12:04 > >> >> To: Lennart Lund; mathi.naic...@oracle.com > >> >> Cc: opensaf-devel@lists.sourceforge.net > >> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > >> resource > >> >> in log agent [#1705] > >> >> > >> >> Hi all, > >> >> > >> >> Have you had time to look at this? Thanks. > >> >> > >> >> Regards, Vu. > >> >> > >> >> > >> >> >-----Original Message----- > >> >> >From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> >> >Sent: Wednesday, April 06, 2016 12:44 PM > >> >> >To: 'Lennart Lund' > >> >> >Cc: 'mathi.naic...@oracle.com'; 'opensaf- > de...@lists.sourceforge.net' > >> >> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > >> resource > >> >> in > >> >> >log agent [#1705] > >> >> > > >> >> >Hi Lennart, > >> >> > > >> >> >Thanks for your comments. > >> >> > > >> >> >I think it is good idea to pick the lgs_state & lga_state out of > >> >> >lga_cb, > >> >> >and use the mutex cb_lock to protect changes in lga control block. > >> >> > > >> >> >I reflect these things in attached file. > >> >> > > >> >> >And to make the code a bit clean, I have defined set/get functions for > >> >> lgs_state > >> >> >& lga_state. > >> >> >Also, I declare the mutex cb_lock as recursive one to avoid possible > >> >> recursively > >> >> >lock in one thread. > >> >> > > >> >> >Could you have a look and give your comments? > >> >> > > >> >> >Regards, Vu. > >> >> > > >> >> >>-----Original Message----- > >> >> >>From: Lennart Lund [mailto:lennart.l...@ericsson.com] > >> >> >>Sent: Tuesday, April 05, 2016 6:59 PM > >> >> >>To: Vu Minh Nguyen > >> >> >>Cc: mathi.naic...@oracle.com; opensaf- > de...@lists.sourceforge.net; > >> >> Lennart > >> >> >>Lund > >> >> >>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > >> >> resource in > >> >> >>log agent [#1705] > >> >> >> > >> >> >>Hi Vu > >> >> >> > >> >> >>See my comments inline [Lennart] > >> >> >> > >> >> >>Note. My comments does not give the complete solution. For that I > >> need > >> >> to > >> >> >>use more time but may give you some ideas. > >> >> >> > >> >> >>/Lennart > >> >> >> > >> >> >>> -----Original Message----- > >> >> >>> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au] > >> >> >>> Sent: den 5 april 2016 10:48 > >> >> >>> To: Lennart Lund > >> >> >>> Cc: mathi.naic...@oracle.com; opensaf- > de...@lists.sourceforge.net > >> >> >>> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common > >> >> resource > >> >> >>> in log agent [#1705] > >> >> >>> > >> >> >>> Hi Lennart, > >> >> >>> > >> >> >>> Thanks for your proposal. > >> >> >>> > >> >> >>> Still, there is race condition b/w mds thread and recover/client > >> thread, I > >> >> >>> think. > >> >> >>> 1) Headless while recover thread 2 is running > >> >> >>> 2) Headless while the client thread is accessing/modifying to client > >> list. > >> >> >>> > >> >> >>[Lennart] In the mds thread the lga_no_server_state_set() is called > if > >> >> server > >> >> >>down is detected. If recovery state is LGA_RECOVERY2 the thread is > >> >> stopped > >> >> >>and recovery state is set to state LGA_NO_SERVER. This means that > we > >> >> start > >> >> >all > >> >> >>over again with recovery state LGA_RECOVERY1 when the server > >> comes > >> >> back > >> >> >>again. > >> >> >>Within the thread the cb lock is taken while the client list is > >> >> >>deleted if > >> >> >needed. > >> >> >>With a new mutex for lga_state this mutex should also be used in all > >> other > >> >> >>places where lga_state is handled not the cb_lock. This means that > the > >> >> name > >> >> >>"recovery2" mutex is not so good instead there should be a > lga_state > >> >> mutex. > >> >> >>The lga_state variable may also be moved outside the lga_cb > structure. > >> >> >> > >> >> >>> I would like to propose other one, using only one mutex cb_lock > >> >> @lga_cb > >> >> >>> to protect common `lga_cb` resource, including `state`, `client` > >> >> >>> list, > >> etc. in > >> >> >>> three threads. > >> >> >>> > >> >> >>[Lennart] The cb_lock is used to protect variables in the cb structure > >> but > >> >> the > >> >> >>lga_state mutex prevent recovery or API operations from being > done > >> at > >> >> the > >> >> >>same time. > >> >> >> > >> >> >>> 1) Client thread > >> >> >>> - Lock mutex before getting the state > >> >> >>> - Unlock right before the API operation is finished > >> >> >>> > >> >> >>[Lennart] Yes but the lga_state mutex > >> >> >> > >> >> >>> 2) Recovery2 thread > >> >> >>> - Lock mutex after timeout > >> >> >>> - Unlock before ending the thread > >> >> >>> > >> >> >>[Lennart] No this will cause the API to be locked as long as the > recovery > >> 2 > >> >> >>thread is running and this is not wanted. The API shall work also > while > >> the > >> >> >>recovery thread is running and most of the APIs shall return with > >> >> >>SA_AIS_TRY_AGAIN during recovery 2 state. > >> >> >> > >> >> >>> 3) MDS thread > >> >> >>> - Lock/unlock before/after changing state > >> >> >>> In case of headless: > >> >> >>> - Lock/unlock mutex before modifying `client_list` in > >> >> >>> lga_no_server_state_set() > >> >> >>> > >> >> >>[Lennart] lga_state is not changed directly in the mds thread but > >> indirectly > >> >> >>when the lga_serv_recov1state_set() and > lga_no_server_state_set() > >> >> functions > >> >> >>are called. I these functions the recovery thread and state change is > >> >> handled > >> >> >>and the mutex used shall be changed to the lga_state mutex. > >> >> >> > >> >> >>The client data should be protected by the cb_lock mutex in all > places > >> >> which > >> >> >is > >> >> >>not the case. There is no protection in the mds trhead in the > >> >> >>lga_lgs_msg_proc() where client data is used via the > >> >> >>ga_find_hdl_rec_by_regid() function. This function should lock the > >> lga_cb > >> >> lock > >> >> >>while handling the lga_cb client data. > >> >> >>Also there are two functions in lga_state.c that are handling client > data > >> >> >locking > >> >> >>a lga_recov_mutex. This does not protect client data handled in the > >> mds > >> >> >>thread. > >> >> >> > >> >> >>> What is your opinion? Do you think there is impact to log client if > they > >> >call > >> >> >>> LOG API in multi threads? > >> >> >>> > >> >> >>[Lennart] The mutex handling should be cleaned up. > >> >> >> - Recovery state should be protected by its own mutex so that APIs > >> can > >> >> work > >> >> >>in respective state without locking. > >> >> >> - Client data should be protected in all threads by the lga_cb lock. > >> >> >> - The lga_recov_mutex can probably be removed > >> >> >> > >> >> >>> Regards, Vu. > >> >> >>> > >> >> >>> > >> >> >>> >-----Original Message----- > >> >> >>> >From: Lennart Lund [mailto:lennart.l...@ericsson.com] > >> >> >>> >Sent: Monday, April 04, 2016 3:51 PM > >> >> >>> >To: Vu Minh Nguyen > >> >> >>> >Cc: mathi.naic...@oracle.com; opensaf- > >> de...@lists.sourceforge.net; > >> >> >>> Lennart > >> >> >>> >Lund > >> >> >>> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for > common > >> >> >resource > >> >> >>> in > >> >> >>> >log agent [#1705] > >> >> >>> > > >> >> >>> >Hi Vu, > >> >> >>> > > >> >> >>> >A simpler solution than described below, maybe: > >> >> >>> > > >> >> >>> >Still use a separate "recovery2" mutex but a "normal one". > >> >> >>> > - In the API, lock the mutex just before checking if we are in > >> recovery > >> >> state > >> >> >>> 2 > >> >> >>> >(before "if (lga_state == LGA_RECOVERY2) {") > >> >> >>> > - In the API, do not unlock until the API operation is finished > >> >> >>> > - In the recovery thread lock the mutex just before changing the > >> >> recovery > >> >> >>> >state to recovery state 2 (replace lock of lga_cb.cb_lock with > >> recovery2 > >> >> >>> mutex) > >> >> >>> >(protect the lga_state variable) > >> >> >>> >- In the recovery thread the mutex shall be unlocked just after > >> >> lga_state is > >> >> >>> set > >> >> >>> >to recovery state 2. > >> >> >>> > > >> >> >>> >This means that the recovery thread cannot set recovery state2 > or > >> >> execute > >> >> >>> >until an eventual ongoing API operation is done. > >> >> >>> >If the thread starts and set recovery state just before the API > checks > >> >> >>> recovery > >> >> >>> >state the API will correctly see that we are in recovery state 2 > >> >> >>> >and > >> >> handle > >> >> >>> thing > >> >> >>> >accordingly. > >> >> >>> > > >> >> >>> >Regards > >> >> >>> >Lennart > >> >> >>> > > >> >> >>> > > >> >> >>> >> -----Original Message----- > >> >> >>> >> From: Lennart Lund > >> >> >>> >> Sent: den 1 april 2016 16:51 > >> >> >>> >> To: Vu Minh Nguyen > >> >> >>> >> Cc: mathi.naic...@oracle.com; opensaf- > >> >> de...@lists.sourceforge.net; > >> >> >>> >> Lennart Lund > >> >> >>> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for > common > >> >> >>> resource > >> >> >>> >> in log agent [#1705] > >> >> >>> >> > >> >> >>> >> Hi Vu, > >> >> >>> >> > >> >> >>> >> You are right, good finding, this may happen since the > recovery2 > >> >> thread > >> >> >is > >> >> >>> >> started and is waiting for a timeout. If this timeout happen at > the > >> >> "wrong > >> >> >>> >> time" bad things may happen. > >> >> >>> >> However the actual problem should be solved. > >> >> >>> >> > >> >> >>> >> This is how it could be done (maybe you can find a better > way?): > >> >> >>> >> When timeout the recovery thread shall set recovery state 2 as > is > >> but > >> >> >shall > >> >> >>> >> not start any recovery if there is any ongoing client activity. > >> >> >>> >> This could be solved by using a "recovery2" mutex (not the > >> existing > >> >> >>> >> lga_cb.cb_lock). > >> >> >>> >> - The recovery thread lock this mutex after setting recovery > state > >> 2. > >> >> >>> >> - During recovery state 1 all APIs check if the mutex is taken > and if > >> so > >> >> >>> return > >> >> >>> >> TRY AGAIN (this means that we are in recovery state 2 but the > >> state > >> >> >>> change > >> >> >>> >> may have happened after the check in the API). Note: "trylock" > >> >> cannot be > >> >> >>> >> used for this purpose since a sequence trylock -> lock is not > >> atomic. > >> >> >>> Maybe a > >> >> >>> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used. > >> >> >>> >> - If the mutex is not taken it shall be locked by the API > >> >> >>> >> function > >> and > >> >> be > >> >> >>> >> unlocked just before the API function returns preventing the > >> >> recovery > >> >> >>> >> thread from starting recovery during an ongoing API operation. > >> >> >>> >> The next time an API is called recovery state 2 is set and the > >> >> >>> >> API > >> >> function > >> >> >>> will > >> >> >>> >> operate accordingly. > >> >> >>> >> > >> >> >>> >> Thanks > >> >> >>> >> Lennart > >> >> >>> >> > >> >> >>> >> > -----Original Message----- > >> >> >>> >> > From: Vu Minh Nguyen > [mailto:vu.m.ngu...@dektech.com.au] > >> >> >>> >> > Sent: den 1 april 2016 16:02 > >> >> >>> >> > To: Lennart Lund > >> >> >>> >> > Cc: mathi.naic...@oracle.com; opensaf- > >> >> de...@lists.sourceforge.net > >> >> >>> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for > >> common > >> >> >>> >> resource > >> >> >>> >> > in log agent [#1705] > >> >> >>> >> > > >> >> >>> >> > Hi Lennart, > >> >> >>> >> > > >> >> >>> >> > Is it possible for following case happens? > >> >> >>> >> > > >> >> >>> >> > Client thread comes to point send_Finalize_msg() @lga_api, > >> then > >> >> LOG > >> >> >>> >> > service up, recovery thread starts and > >> >> >>> >> > comes to the point right after "while (pclient != NULL)" in > >> >> >>> >> > recovery2_thread @lga_state.c > >> >> >>> >> > Then, the context is switching to the client thread and it > deletes > >> the > >> >> >>> >> > resource while recovery2_thread is running? > >> >> >>> >> > > >> >> >>> >> > Regards, Vu. > >> >> >>> >> > > >> >> >>> >> > Quoting Lennart Lund <lennart.l...@ericsson.com>: > >> >> >>> >> > > >> >> >>> >> > > Hi Vu, > >> >> >>> >> > > > >> >> >>> >> > > /* Recover clients one at a time */ > >> >> >>> >> > > rc = recover_one_client(p_client); > >> >> >>> >> > > + if (p_client == NULL) goto done; > >> >> >>> >> > > + > >> >> >>> >> > > [Lennart] Why is this needed? This check is done in "while > >> >> >>> >> > > (p_client != NULL) { " > >> >> >>> >> > > and p_client cannot become NULL until "p_client = p_client- > >> >> >next; " > >> >> >>> >> > > is done as > >> >> >>> >> > > the last line in the while loop. > >> >> >>> >> > > > >> >> >>> >> > > [Vu] It is my intention. I am afraid if the context is > >> >> >>> >> > > switched > to > >> >> >>> >> > > client thread > >> >> >>> >> > > right after the recovery thread enters the while loop. > >> >> >>> >> > > If it is the case, p_client will be NULL after done > >> >> recover_one_client(). > >> >> >>> >> > > > >> >> >>> >> > > [Lennart] I don't really understand your explanation. What > is > >> >> >>> >> > > happening in the client thread when we are in recovery > state 2 > >> >> that > >> >> >>> >> > > can set p_client == NULL? The client thread should not be > able > >> to > >> >> do > >> >> >>> >> > > anything with the data involved when recovering client in > the > >> >> >>> >> > > recovery thread? > >> >> >>> >> > > > >> >> >>> >> > > Regards > >> >> >>> >> > > Lennart > >> >> >>> >> > > > >> >> >>> >> > >> -----Original Message----- > >> >> >>> >> > >> From: Vu Minh Nguyen > >> [mailto:vu.m.ngu...@dektech.com.au] > >> >> >>> >> > >> Sent: den 1 april 2016 14:14 > >> >> >>> >> > >> To: Lennart Lund > >> >> >>> >> > >> Cc: mathi.naic...@oracle.com; opensaf- > >> >> de...@lists.sourceforge.net > >> >> >>> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for > >> >> common > >> >> >>> >> > resource > >> >> >>> >> > >> in log agent [#1705] > >> >> >>> >> > >> > >> >> >>> >> > >> Hi Lennart, > >> >> >>> >> > >> > >> >> >>> >> > >> Please see my comment inline. > >> >> >>> >> > >> > >> >> >>> >> > >> Regards, Vu > >> >> >>> >> > >> > >> >> >>> >> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>: > >> >> >>> >> > >> > >> >> >>> >> > >> > Hi Vu > >> >> >>> >> > >> > > >> >> >>> >> > >> > Ack with comments > >> >> >>> >> > >> > > >> >> >>> >> > >> > See my comments inline [Lennart]: > >> >> >>> >> > >> > > >> >> >>> >> > >> > Thanks > >> >> >>> >> > >> > Lennart > >> >> >>> >> > >> > > >> >> >>> >> > >> >> -----Original Message----- > >> >> >>> >> > >> >> From: Vu Minh Nguyen > >> >> [mailto:vu.m.ngu...@dektech.com.au] > >> >> >>> >> > >> >> Sent: den 23 mars 2016 08:16 > >> >> >>> >> > >> >> To: mathi.naic...@oracle.com; Lennart Lund > >> >> >>> >> > >> >> Cc: opensaf-devel@lists.sourceforge.net > >> >> >>> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for > >> >> common > >> >> >>> >> > resource > >> >> >>> >> > >> in > >> >> >>> >> > >> >> log agent [#1705] > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> osaf/libs/agents/saf/lga/lga_state.c | 41 > >> >> >>> >> > >> >> +++++++++++++++++++++++++++++++---- > >> >> >>> >> > >> >> osaf/libs/agents/saf/lga/lga_state.h | 1 + > >> >> >>> >> > >> >> osaf/libs/agents/saf/lga/lga_util.c | 30 > +++++++++++---- > >> ---- > >> >> ----- > >> >> >>> - > >> >> >>> >> > >> >> 3 files changed, 50 insertions(+), 22 deletions(-) > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> There was an race condition between client thread and > >> >> recovery > >> >> >>> >> > thread > >> >> >>> >> > >> >> accessing to common resource `client_list`. > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> Add mutex protection. > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c > >> >> >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c > >> >> >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c > >> >> >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c > >> >> >>> >> > >> >> @@ -355,8 +355,11 @@ static void > >> *recovery2_thread(void > >> >> >>> *dumm > >> >> >>> >> > >> >> /* We have been > >> >> signaled to exit > >> >> >>> >> > >> >> */ > >> >> >>> >> > >> >> goto done; > >> >> >>> >> > >> >> } > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> /* Recover clients one at a time */ > >> >> >>> >> > >> >> rc = recover_one_client(p_client); > >> >> >>> >> > >> >> + if (p_client == NULL) goto done; > >> >> >>> >> > >> >> + > >> >> >>> >> > >> > [Lennart] Why is this needed? This check is done in > "while > >> >> >(p_client > >> >> >>> >> > >> > != NULL) { " and p_client cannot become NULL until > >> "p_client = > >> >> >>> >> > >> > p_client->next; " is done as the last line in the while > loop. > >> >> >>> >> > >> > > >> >> >>> >> > >> [Vu] It is my intention. I am afraid if the context is > switched to > >> >> >>> >> > >> client thread > >> >> >>> >> > >> right after the recovery thread enters the while loop. > >> >> >>> >> > >> If it is the case, p_client will be NULL after done > >> >> >>> recover_one_client(). > >> >> >>> >> > >> > >> >> >>> >> > >> >> TRACE("\t Client %d is recovered", > >> >> p_client- > >> >> >>> >> > >> >> >lgs_client_id); > >> >> >>> >> > >> >> if (rc == -1) { > >> >> >>> >> > >> >> TRACE("%s > >> >> recover_one_client > >> >> >>> >> > >> >> Fail Deleting cllient (id %d)", > >> >> >>> >> > >> >> @@ -593,6 +596,15 @@ int > >> >> recover_one_client(lga_client_hdl_re > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> TRACE_ENTER(); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> + if (p_client == NULL) { > >> >> >>> >> > >> >> + /** > >> >> >>> >> > >> >> + * Probably the client_list has > >> >> been deleted by > >> >> >>> >> > >> >> client thread > >> >> >>> >> > >> >> + * in context calling > >> >> saLogFinalize(). So, give up > >> >> >>> >> > >> >> recovering. > >> >> >>> >> > >> >> + */ > >> >> >>> >> > >> >> + rc = -1; > >> >> >>> >> > >> >> + goto done; > >> >> >>> >> > >> >> + } > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> /* This function may be called both in the > >> >> recovery thread and > >> >> >>> >> > >> >> in the > >> >> >>> >> > >> >> * client thread. A possible scenario is that > >> >> >>> >> > >> >> the > >> >> recovery 2 > >> >> >>> >> > >> >> thread > >> >> >>> >> > >> >> * times out and calls this function in recovery > >> >> mode 2 while > >> >> >>> >> > >> >> there > >> >> >>> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client( > >> >> >>> >> > >> >> lga_client_hdl_rec_t *rm_node > >> >> >>> >> > >> >> ) > >> >> >>> >> > >> >> { > >> >> >>> >> > >> >> - TRACE_ENTER2(); > >> >> >>> >> > >> >> - uint32_t ncs_rc; > >> >> >>> >> > >> >> + return lga_hdl_rec_del(list_head, rm_node); > >> >> >>> >> > >> >> +} > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> +/** > >> >> >>> >> > >> >> + * Free the memory allocated for one client handle > with > >> >> mutext > >> >> >>> >> > >> protection. > >> >> >>> >> > >> >> + * > >> >> >>> >> > >> >> + * @param p_client_hdl > >> >> >>> >> > >> >> + * @return void > >> >> >>> >> > >> >> + */ > >> >> >>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t > >> **p_client_hdl) > >> >> >>> >> > >> >> +{ > >> >> >>> >> > >> >> + lga_client_hdl_rec_t *client_hdl; > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> + /** > >> >> >>> >> > >> >> + * To prevent race condition b/w client thread > >> >> and recovery > >> >> >>> >> > >> >> thread > >> >> >>> >> > >> >> + * accessing to common resource `client_list`. > >> >> [#1705] > >> >> >>> >> > >> >> + */ > >> >> >>> >> > >> >> osaf_mutex_lock_ordie(&lga_recov_mutex); > >> >> >>> >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node); > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> + client_hdl = *p_client_hdl; > >> >> >>> >> > >> >> + if (client_hdl == NULL) goto done; > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> + free(client_hdl); > >> >> >>> >> > >> >> + client_hdl = NULL; > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> +done: > >> >> >>> >> > >> >> osaf_mutex_unlock_ordie(&lga_recov_mutex); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> - TRACE_LEAVE(); > >> >> >>> >> > >> >> - return ncs_rc; > >> >> >>> >> > >> >> } > >> >> >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h > >> >> >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h > >> >> >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h > >> >> >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h > >> >> >>> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client( > >> >> >>> >> > >> >> lga_client_hdl_rec_t **list_head, > >> >> >>> >> > >> >> lga_client_hdl_rec_t *rm_node > >> >> >>> >> > >> >> ); > >> >> >>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t > >> **p_client_hdl); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> #ifdef __cplusplus > >> >> >>> >> > >> >> } > >> >> >>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c > >> >> >>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c > >> >> >>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c > >> >> >>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c > >> >> >>> >> > >> >> @@ -19,6 +19,7 @@ > >> >> >>> >> > >> >> #include <syslog.h> > >> >> >>> >> > >> >> #include "lga.h" > >> >> >>> >> > >> >> #include "osaf_poll.h" > >> >> >>> >> > >> >> +#include "lga_state.h" > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> /* Variables used during startup/shutdown only */ > >> >> >>> >> > >> >> static pthread_mutex_t lga_lock = > >> >> >>> PTHREAD_MUTEX_INITIALIZER; > >> >> >>> >> > >> >> @@ -449,11 +450,9 @@ void > >> >> lga_hdl_list_del(lga_client_hdl_rec > >> >> >>> >> > >> >> /** clean up the channel records for this lga- > >> >> client > >> >> >>> >> > >> >> **/ > >> >> >>> >> > >> >> > >> >> lga_log_stream_hdl_rec_list_del(&client_hdl- > >> >> >>> >> > >> >> >stream_list); > >> >> >>> >> > >> >> - /** remove the association with hdl-mngr > >> >> >>> >> > >> >> - **/ > >> >> >>> >> > >> >> - free(client_hdl); > >> >> >>> >> > >> >> - client_hdl = 0; > >> >> >>> >> > >> >> + free_client_hdl(&client_hdl); > >> >> >>> >> > >> >> } > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> TRACE_LEAVE(); > >> >> >>> >> > >> >> } > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> @@ -536,38 +535,35 @@ uint32_t > >> >> >>> lga_hdl_rec_del(lga_client_hdl_ > >> >> >>> >> > >> >> if (list_iter == rm_node) { > >> >> >>> >> > >> >> *list_head = rm_node->next; > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> - /** detach & release the IPC > >> >> >>> >> > >> >> - **/ > >> >> >>> >> > >> >> + /* detach & release the IPC */ > >> >> >>> >> > >> >> m_NCS_IPC_DETACH(&rm_node- > >> >> >mbx, > >> >> >>> >> > >> >> lga_clear_mbx, NULL); > >> >> >>> >> > >> >> m_NCS_IPC_RELEASE(&rm_node- > >> >> >mbx, NULL); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> ncshm_give_hdl(rm_node- > >> >> >local_hdl); > >> >> >>> >> > >> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, > >> >> >>> >> > >> >> rm_node->local_hdl); > >> >> >>> >> > >> >> - /** Free the channel records off this hdl > >> >> >>> >> > >> >> - **/ > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> + /* Free the channel records off > >> >> this hdl */ > >> >> >>> >> > >> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- > >> >> >>> >> > >> >> >stream_list); > >> >> >>> >> > >> >> + free_client_hdl(&rm_node); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> - /** free the hdl rec > >> >> >>> >> > >> >> - **/ > >> >> >>> >> > >> >> - free(rm_node); > >> >> >>> >> > >> >> rc = NCSCC_RC_SUCCESS; > >> >> >>> >> > >> >> goto out; > >> >> >>> >> > >> >> - } else { /* find the rec */ > >> >> >>> >> > >> >> + } else { > >> >> >>> >> > >> >> + /* find the rec */ > >> >> >>> >> > >> >> while (NULL != list_iter) { > >> >> >>> >> > >> >> if (list_iter->next == > >> >> rm_node) { > >> >> >>> >> > >> >> > >> >> list_iter->next = > >> >> >>> >> > >> >> rm_node->next; > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> - /** detach & release the IPC */ > >> >> >>> >> > >> >> + /* > >> >> detach & release > >> >> >>> >> > >> >> the IPC */ > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, > >> >> lga_clear_mbx, > >> >> >>> >> > >> NULL); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, > >> >> rm_node- > >> >> >>> >> > >> >> >local_hdl); > >> >> >>> >> > >> >> - /** Free the channel records off > >> >> this lga_hdl */ > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> + /* > >> >> Free the channel > >> >> >>> >> > >> >> records off this lga_hdl */ > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- > >> >> >stream_list); > >> >> >>> >> > >> >> - > >> >> >>> >> > >> >> - /** free the hdl rec */ > >> >> >>> >> > >> >> - > >> >> free(rm_node); > >> >> >>> >> > >> >> + > >> >> >>> >> > >> >> free_client_hdl(&rm_node); > >> >> >>> >> > >> >> > >> >> >>> >> > >> >> rc = > >> >> >>> >> > >> >> NCSCC_RC_SUCCESS; > >> >> >>> >> > >> >> goto > >> >> out; > >> >> >>> >> > >> > >> >> >>> >> > > >> >> >>> > >> >> > >> ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel