Hi Minh Yes, this is what I suggested but see my comments inline [Lennart]
It is not a good idea to use ASYNCHRONOUS since some functions used in the thread is not allowed if this is the cancellation type. See "Async-cancel-safe functions" in pthreads section in Linux man pages. However, PTHREAD_CANCEL_DEFERRED do create a potential problem. If the cancellation request is given when the thread is no longer waiting in waitpid() but before it has taken the mutex the thread will then reach create_imcnprocess() that will create the imcnprocess. To be able to kill the process the Pid is needed and for that purpose the Pid saved in the ipar structure is used. However the pid is not saved until after the create_imcnprocess() returns. If for some reason the thread is cancelled after the process is created but before the Pid is saved the process will not be possible to kill (we have no Pid). The current code does not have a cancellation point between the fork() and saving of the pid so this should not happen but you never know what will happen to the code in the future so I suggest that pthread_testcancel() is called just before fork() is called. As I understand the original problem is that the ntfd process is aborted because it is trying to destroy the ntfimcn_mutex while it is locked by the thread but why is this happening? As far as I can see there is no cancellation point between where the mutex is locked and unlocked and the cancellation type is PTHREAD_CANCEL_DEFERRED. The only cancellation point in the thread is the waitpid(). Thanks Lennart > -----Original Message----- > From: minh chau [mailto:[email protected]] > Sent: den 23 juni 2017 07:42 > To: Lennart Lund <[email protected]>; > [email protected] > Cc: [email protected]; Anders Widell > <[email protected]> > Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after > cnsurvail_thread is canceled [#2508] > > Hi Lennart, > > Using ASYNCHRONOUS in this thread does not sound that bad, since the > thread does not deal with IO, allocation/deallocation, ... > By the way, is the patch below what your suggestion if I understand it > right? > > Thanks, > Minh > > diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c > index dd27a255c..7f817075f 100644 > --- a/src/ntf/ntfd/ntfs_imcnutil.c > +++ b/src/ntf/ntfd/ntfs_imcnutil.c > @@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT > ha_state) > TRACE_LEAVE(); > return i_pid; > } > +/* > + * Cleanup handler for cnsurvail thread > + * @param: arg[in]: a mutex > + */ > +void cnsurvail_cleanup_handler(void* arg) > +{ > + pthread_mutex_t* mutex = (pthread_mutex_t*)(arg); > + TRACE_ENTER(); > + if (mutex) { > + TRACE("osaf_mutex_unlock_ordie"); > + osaf_mutex_unlock_ordie(&ntfimcn_mutex); [Lennart] This seems incorrect. Should be: mutex_unlock_ordie(mutex); > + } > + TRACE_LEAVE(); > +} > > /** > * Thread: > @@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params) > > while (1) { > osaf_mutex_lock_ordie(&ntfimcn_mutex); > + pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex); > pid = create_imcnprocess(ipar->ha_state); > ipar->pid = pid; > - osaf_mutex_unlock_ordie(&ntfimcn_mutex); > + pthread_cleanup_pop(1); > > /* Wait for child process to exit */ > do { > @@ -344,7 +359,9 @@ int stop_ntfimcn(void) > > if (ipar.ha_state != 0) { > TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__); > + osaf_mutex_lock_ordie(&ntfimcn_mutex); > rc = pthread_cancel(ipar.thread); > + osaf_mutex_unlock_ordie(&ntfimcn_mutex); > if (rc != 0) > osaf_abort(rc); > rc = pthread_join(ipar.thread, &join_ret); > > On 22/06/17 18:57, Lennart Lund wrote: > > Hi Minh > > > > This solution will probably work quite well but there is still a minor risk > > that > the mutex will be taken before the thread is cancelled and there is also a > risk > (see Note2) > > "Asynchronous cancelability means that the thread can be canceled at any > time (usually immediately, but the system does not guarantee this)" > > > > Instead: > > In the thread a cancellation function that unlocks the mutex can be pushed > just after the mutex is locked. This function takes the mutex as in-parameter > and its only function is to unlock the mutex. Instead of unlocking the mutex > by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code > the cancellation function is popped and executed (it is unlocking the mutex) > See pthread_cleanup_push() and pthread_cleanup_pop(). With this solution > the cancellation type should not be changed to ASYNCHORNOUS it shall > remain PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will > never be terminated with the mutex locked. > > Note1: Also with this solution the mutex shall be taken in the main thread > before rc = pthread_cancel(ipar.thread); and released just after. > > > > Note2: I found that changing to cancellation type ASYNCHORNOUS is not > safe because only a few are async-cancel-safe and for example waitpid() is > not one of the safe fuctions! > > > > Thanks > > Lennart > > > >> -----Original Message----- > >> From: Minh Chau [mailto:[email protected]] > >> Sent: den 22 juni 2017 06:14 > >> To: Lennart Lund <[email protected]>; > >> [email protected] > >> Cc: [email protected]; Minh Hon Chau > >> <[email protected]> > >> Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after > cnsurvail_thread > >> is canceled [#2508] > >> > >> In the scenario of shutting down SC while SC switchover is on going, > >> ntfd coredump is generated due to failure of pthread_mutex_destroy() > >> with errorcode:16(EBUSY). That means the mutex had been taken and > >> was not unlocked at the time phtread_mutex_destroy() is called. > >> > >> One solution is adding mutex protection for pthread_cancel() so that > >> there's no cancellation request if cnsurvail_thread() is taking mutex, > >> or cnsurvail_thread() can not take mutex if the thread cancellation > >> request is issued. That also needs the cnsurvail_thread to have the > >> cancellation type as ASYNCHORNOUS. Otherwise the same coredump > issue > >> still occurs since the cancellation request is deffered (cancellation > >> type as PTHREAD_CANCEL_DEFERRED set by default) > >> --- > >> src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c > >> index dd27a255c..672a3910e 100644 > >> --- a/src/ntf/ntfd/ntfs_imcnutil.c > >> +++ b/src/ntf/ntfd/ntfs_imcnutil.c > >> @@ -238,6 +238,10 @@ static void *cnsurvail_thread(void *_init_params) > >> > >> TRACE_ENTER(); > >> > >> + pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status); > >> + pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, > >> &status); > >> + TRACE("old cancellation type: %d", status); > >> + > >> while (1) { > >> osaf_mutex_lock_ordie(&ntfimcn_mutex); > >> pid = create_imcnprocess(ipar->ha_state); > >> @@ -344,7 +348,9 @@ int stop_ntfimcn(void) > >> > >> if (ipar.ha_state != 0) { > >> TRACE("%s: Cancel the imcn surveillance thread", > >> __FUNCTION__); > >> + osaf_mutex_lock_ordie(&ntfimcn_mutex); > >> rc = pthread_cancel(ipar.thread); > >> + osaf_mutex_unlock_ordie(&ntfimcn_mutex); > >> if (rc != 0) > >> osaf_abort(rc); > >> rc = pthread_join(ipar.thread, &join_ret); > >> -- > >> 2.11.0 > > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
