Hi Minh, Is trace enabled when this problem occurs? If so trace adds cancellation points, e.g. at fork/exec in create_imcnprocess. In stop_ntfimcn, pthread_join the join_ret variable is not used, use NULL instead, looking at the backtrace in the ticket, it's value is -1.
/BR Hans -----Original Message----- From: minh chau [mailto:[email protected]] Sent: den 23 juni 2017 07:42 To: Lennart Lund <[email protected]>; [email protected] Cc: [email protected] Subject: Re: [devel] [PATCH 1/1] ntfd: Ensure mutex is not taken after cnsurvail_thread is canceled [#2508] Hi Lennart, Using ASYNCHRONOUS in this thread does not sound that bad, since the thread does not deal with IO, allocation/deallocation, ... By the way, is the patch below what your suggestion if I understand it right? Thanks, Minh diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c index dd27a255c..7f817075f 100644 --- a/src/ntf/ntfd/ntfs_imcnutil.c +++ b/src/ntf/ntfd/ntfs_imcnutil.c @@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT ha_state) TRACE_LEAVE(); return i_pid; } +/* + * Cleanup handler for cnsurvail thread + * @param: arg[in]: a mutex + */ +void cnsurvail_cleanup_handler(void* arg) { + pthread_mutex_t* mutex = (pthread_mutex_t*)(arg); + TRACE_ENTER(); + if (mutex) { + TRACE("osaf_mutex_unlock_ordie"); + osaf_mutex_unlock_ordie(&ntfimcn_mutex); + } + TRACE_LEAVE(); +} /** * Thread: @@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params) while (1) { osaf_mutex_lock_ordie(&ntfimcn_mutex); + pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex); pid = create_imcnprocess(ipar->ha_state); ipar->pid = pid; - osaf_mutex_unlock_ordie(&ntfimcn_mutex); + pthread_cleanup_pop(1); /* Wait for child process to exit */ do { @@ -344,7 +359,9 @@ int stop_ntfimcn(void) if (ipar.ha_state != 0) { TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__); + osaf_mutex_lock_ordie(&ntfimcn_mutex); rc = pthread_cancel(ipar.thread); + osaf_mutex_unlock_ordie(&ntfimcn_mutex); if (rc != 0) osaf_abort(rc); rc = pthread_join(ipar.thread, &join_ret); On 22/06/17 18:57, Lennart Lund wrote: > Hi Minh > > This solution will probably work quite well but there is still a minor > risk that the mutex will be taken before the thread is cancelled and there is > also a risk (see Note2) "Asynchronous cancelability means that the thread can > be canceled at any time (usually immediately, but the system does not > guarantee this)" > > Instead: > In the thread a cancellation function that unlocks the mutex can be pushed > just after the mutex is locked. This function takes the mutex as in-parameter > and its only function is to unlock the mutex. Instead of unlocking the mutex > by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code the > cancellation function is popped and executed (it is unlocking the mutex) See > pthread_cleanup_push() and pthread_cleanup_pop(). With this solution the > cancellation type should not be changed to ASYNCHORNOUS it shall remain > PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will never be > terminated with the mutex locked. > Note1: Also with this solution the mutex shall be taken in the main thread > before rc = pthread_cancel(ipar.thread); and released just after. > > Note2: I found that changing to cancellation type ASYNCHORNOUS is not safe > because only a few are async-cancel-safe and for example waitpid() is not one > of the safe fuctions! > > Thanks > Lennart > >> -----Original Message----- >> From: Minh Chau [mailto:[email protected]] >> Sent: den 22 juni 2017 06:14 >> To: Lennart Lund <[email protected]>; >> [email protected] >> Cc: [email protected]; Minh Hon Chau >> <[email protected]> >> Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after >> cnsurvail_thread is canceled [#2508] >> >> In the scenario of shutting down SC while SC switchover is on going, >> ntfd coredump is generated due to failure of pthread_mutex_destroy() >> with errorcode:16(EBUSY). That means the mutex had been taken and was >> not unlocked at the time phtread_mutex_destroy() is called. >> >> One solution is adding mutex protection for pthread_cancel() so that >> there's no cancellation request if cnsurvail_thread() is taking >> mutex, or cnsurvail_thread() can not take mutex if the thread >> cancellation request is issued. That also needs the cnsurvail_thread >> to have the cancellation type as ASYNCHORNOUS. Otherwise the same >> coredump issue still occurs since the cancellation request is >> deffered (cancellation type as PTHREAD_CANCEL_DEFERRED set by >> default) >> --- >> src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/src/ntf/ntfd/ntfs_imcnutil.c >> b/src/ntf/ntfd/ntfs_imcnutil.c index dd27a255c..672a3910e 100644 >> --- a/src/ntf/ntfd/ntfs_imcnutil.c >> +++ b/src/ntf/ntfd/ntfs_imcnutil.c >> @@ -238,6 +238,10 @@ static void *cnsurvail_thread(void >> *_init_params) >> >> TRACE_ENTER(); >> >> + pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status); >> + pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, >> &status); >> + TRACE("old cancellation type: %d", status); >> + >> while (1) { >> osaf_mutex_lock_ordie(&ntfimcn_mutex); >> pid = create_imcnprocess(ipar->ha_state); >> @@ -344,7 +348,9 @@ int stop_ntfimcn(void) >> >> if (ipar.ha_state != 0) { >> TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__); >> + osaf_mutex_lock_ordie(&ntfimcn_mutex); >> rc = pthread_cancel(ipar.thread); >> + osaf_mutex_unlock_ordie(&ntfimcn_mutex); >> if (rc != 0) >> osaf_abort(rc); >> rc = pthread_join(ipar.thread, &join_ret); >> -- >> 2.11.0 > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
