Hi Minh See my comment inline [Lennart]
Thanks Lennart > -----Original Message----- > From: minh chau [mailto:[email protected]] > Sent: den 27 juni 2017 07:28 > To: Lennart Lund <[email protected]>; > [email protected] > Cc: [email protected]; Anders Widell > <[email protected]> > Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after > cnsurvail_thread is canceled [#2508] > > Hi Lennart, > > Please see my comments inline [Minh] > > Thanks, > Minh > > On 27/06/17 00:16, Lennart Lund wrote: > > Hi Minh > > > > Yes, this is what I suggested but see my comments inline [Lennart] > > > > It is not a good idea to use ASYNCHRONOUS since some functions used in > the thread is not allowed if this is the cancellation type. See "Async-cancel- > safe functions" in pthreads section in Linux man pages. > [Minh] So we need to re-ensure ASYNCHRONOUS is also not used in > somewhere else in our code. > > However, PTHREAD_CANCEL_DEFERRED do create a potential problem. If > the cancellation request is given when the thread is no longer waiting in > waitpid() but before it has taken the mutex the thread will then reach > create_imcnprocess() that will create the imcnprocess. To be able to kill the > process the Pid is needed and for that purpose the Pid saved in the ipar > structure is used. However the pid is not saved until after the > create_imcnprocess() returns. If for some reason the thread is cancelled > after the process is created but before the Pid is saved the process will not > be possible to kill (we have no Pid). The current code does not have a > cancellation point between the fork() and saving of the pid so this should not > happen but you never know what will happen to the code in the future so I > suggest that pthread_testcancel() is called just before fork() is called. > [Minh]: I think this ticket is originated from 2 problems: > 1. Stopping surveillance should be in reverse order of starting it. That > means, start sequence currently is: ntfd mainthread -> > cnsurvail_thread() -> ntfimcnd. However, stop sequence is: > cnsurvail_thread -> ntfimcnd -> ntfd_mainthread. So that should be the > reason of the above problem you mentioned, where we don't Pid to kill > ntfimcnd > 2. In stopping sequence from top down view of surveillance, the main > thread is trying to stop cnsurvail_thread, while the cnsurvail_thread is > going to start ntfimcnd. This is causing trouble and there should be > someway to get cnsurvail_thread being aware that it does not start > ntfimcnd in stopping sequence. > > Attaching a simple patch that solves 2 above problems, so we don't have > to add too much threading stuffs for this ticket [Lennart] The solution in your patch seems to do the job. However it is cryptic and hard to understand. This could be fixed by adding simple comments in both the thread and stop functions. Re-using the ha_state flag here for a completely different purpose than it is intended for and named to inform about is most confusing and misleading! > > > As I understand the original problem is that the ntfd process is aborted > because it is trying to destroy the ntfimcn_mutex while it is locked by the > thread but why is this happening? > > As far as I can see there is no cancellation point between where the mutex > is locked and unlocked and the cancellation type is > PTHREAD_CANCEL_DEFERRED. The only cancellation point in the thread is the > waitpid(). > [Minh]: This problem happened I think because one of other cancellation > points inside log/trace get called, where create_imcnprocess() does > LOG_ER/TRACE function. > > > > Thanks > > Lennart > > > >> -----Original Message----- > >> From: minh chau [mailto:[email protected]] > >> Sent: den 23 juni 2017 07:42 > >> To: Lennart Lund<[email protected]>; > >> [email protected] > >> Cc:[email protected]; Anders Widell > >> <[email protected]> > >> Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after > >> cnsurvail_thread is canceled [#2508] > >> > >> Hi Lennart, > >> > >> Using ASYNCHRONOUS in this thread does not sound that bad, since the > >> thread does not deal with IO, allocation/deallocation, ... > >> By the way, is the patch below what your suggestion if I understand it > >> right? > >> > >> Thanks, > >> Minh > >> > >> diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c > >> index dd27a255c..7f817075f 100644 > >> --- a/src/ntf/ntfd/ntfs_imcnutil.c > >> +++ b/src/ntf/ntfd/ntfs_imcnutil.c > >> @@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT > >> ha_state) > >> TRACE_LEAVE(); > >> return i_pid; > >> } > >> +/* > >> + * Cleanup handler for cnsurvail thread > >> + * @param: arg[in]: a mutex > >> + */ > >> +void cnsurvail_cleanup_handler(void* arg) > >> +{ > >> + pthread_mutex_t* mutex = (pthread_mutex_t*)(arg); > >> + TRACE_ENTER(); > >> + if (mutex) { > >> + TRACE("osaf_mutex_unlock_ordie"); > >> + osaf_mutex_unlock_ordie(&ntfimcn_mutex); > > [Lennart] This seems incorrect. Should be: mutex_unlock_ordie(mutex); > >> + } > >> + TRACE_LEAVE(); > >> +} > >> > >> /** > >> * Thread: > >> @@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params) > >> > >> while (1) { > >> osaf_mutex_lock_ordie(&ntfimcn_mutex); > >> + pthread_cleanup_push(cnsurvail_cleanup_handler, > &ntfimcn_mutex); > >> pid = create_imcnprocess(ipar->ha_state); > >> ipar->pid = pid; > >> - osaf_mutex_unlock_ordie(&ntfimcn_mutex); > >> + pthread_cleanup_pop(1); > >> > >> /* Wait for child process to exit */ > >> do { > >> @@ -344,7 +359,9 @@ int stop_ntfimcn(void) > >> > >> if (ipar.ha_state != 0) { > >> TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__); > >> + osaf_mutex_lock_ordie(&ntfimcn_mutex); > >> rc = pthread_cancel(ipar.thread); > >> + osaf_mutex_unlock_ordie(&ntfimcn_mutex); > >> if (rc != 0) > >> osaf_abort(rc); > >> rc = pthread_join(ipar.thread, &join_ret); > >> > >> On 22/06/17 18:57, Lennart Lund wrote: > >>> Hi Minh > >>> > >>> This solution will probably work quite well but there is still a minor > >>> risk > that > >> the mutex will be taken before the thread is cancelled and there is also a > risk > >> (see Note2) > >>> "Asynchronous cancelability means that the thread can be canceled at > any > >> time (usually immediately, but the system does not guarantee this)" > >>> Instead: > >>> In the thread a cancellation function that unlocks the mutex can be > pushed > >> just after the mutex is locked. This function takes the mutex as in- > parameter > >> and its only function is to unlock the mutex. Instead of unlocking the > mutex > >> by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current > code > >> the cancellation function is popped and executed (it is unlocking the > mutex) > >> See pthread_cleanup_push() and pthread_cleanup_pop(). With this > solution > >> the cancellation type should not be changed to ASYNCHORNOUS it shall > >> remain PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread > will > >> never be terminated with the mutex locked. > >>> Note1: Also with this solution the mutex shall be taken in the main > thread > >> before rc = pthread_cancel(ipar.thread); and released just after. > >>> Note2: I found that changing to cancellation type ASYNCHORNOUS is not > >> safe because only a few are async-cancel-safe and for example waitpid() > is > >> not one of the safe fuctions! > >>> Thanks > >>> Lennart > >>> > >>>> -----Original Message----- > >>>> From: Minh Chau [mailto:[email protected]] > >>>> Sent: den 22 juni 2017 06:14 > >>>> To: Lennart Lund<[email protected]>; > >>>> [email protected] > >>>> Cc:[email protected]; Minh Hon Chau > >>>> <[email protected]> > >>>> Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after > >> cnsurvail_thread > >>>> is canceled [#2508] > >>>> > >>>> In the scenario of shutting down SC while SC switchover is on going, > >>>> ntfd coredump is generated due to failure of > pthread_mutex_destroy() > >>>> with errorcode:16(EBUSY). That means the mutex had been taken and > >>>> was not unlocked at the time phtread_mutex_destroy() is called. > >>>> > >>>> One solution is adding mutex protection for pthread_cancel() so that > >>>> there's no cancellation request if cnsurvail_thread() is taking mutex, > >>>> or cnsurvail_thread() can not take mutex if the thread cancellation > >>>> request is issued. That also needs the cnsurvail_thread to have the > >>>> cancellation type as ASYNCHORNOUS. Otherwise the same coredump > >> issue > >>>> still occurs since the cancellation request is deffered (cancellation > >>>> type as PTHREAD_CANCEL_DEFERRED set by default) > >>>> --- > >>>> src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++ > >>>> 1 file changed, 6 insertions(+) > >>>> > >>>> diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c > >>>> index dd27a255c..672a3910e 100644 > >>>> --- a/src/ntf/ntfd/ntfs_imcnutil.c > >>>> +++ b/src/ntf/ntfd/ntfs_imcnutil.c > >>>> @@ -238,6 +238,10 @@ static void *cnsurvail_thread(void > *_init_params) > >>>> > >>>> TRACE_ENTER(); > >>>> > >>>> + pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status); > >>>> + pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, > >>>> &status); > >>>> + TRACE("old cancellation type: %d", status); > >>>> + > >>>> while (1) { > >>>> osaf_mutex_lock_ordie(&ntfimcn_mutex); > >>>> pid = create_imcnprocess(ipar->ha_state); > >>>> @@ -344,7 +348,9 @@ int stop_ntfimcn(void) > >>>> > >>>> if (ipar.ha_state != 0) { > >>>> TRACE("%s: Cancel the imcn surveillance thread", > >>>> __FUNCTION__); > >>>> + osaf_mutex_lock_ordie(&ntfimcn_mutex); > >>>> rc = pthread_cancel(ipar.thread); > >>>> + osaf_mutex_unlock_ordie(&ntfimcn_mutex); > >>>> if (rc != 0) > >>>> osaf_abort(rc); > >>>> rc = pthread_join(ipar.thread, &join_ret); > >>>> -- > >>>> 2.11.0 ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
