Hi Minh,

Is trace enabled when this problem occurs? If so trace adds cancellation 
points, e.g. at fork/exec in create_imcnprocess. 
In stop_ntfimcn, pthread_join the join_ret variable is not used, use NULL 
instead, looking at the backtrace in the ticket, it's value is -1.

/BR Hans


-----Original Message-----
From: minh chau [mailto:[email protected]] 
Sent: den 23 juni 2017 07:42
To: Lennart Lund <[email protected]>; [email protected]
Cc: [email protected]
Subject: Re: [devel] [PATCH 1/1] ntfd: Ensure mutex is not taken after 
cnsurvail_thread is canceled [#2508]

Hi Lennart,

Using ASYNCHRONOUS in this thread does not sound that bad, since the thread 
does not deal with IO, allocation/deallocation, ...
By the way, is the patch below what your suggestion if I understand it right?

Thanks,
Minh

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c index 
dd27a255c..7f817075f 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT ha_state)
      TRACE_LEAVE();
      return i_pid;
  }
+/*
+ * Cleanup handler for cnsurvail thread
+ * @param: arg[in]: a mutex
+ */
+void cnsurvail_cleanup_handler(void* arg) {
+  pthread_mutex_t* mutex = (pthread_mutex_t*)(arg);
+  TRACE_ENTER();
+  if (mutex) {
+    TRACE("osaf_mutex_unlock_ordie");
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+  }
+  TRACE_LEAVE();
+}

  /**
   * Thread:
@@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params)

      while (1) {
          osaf_mutex_lock_ordie(&ntfimcn_mutex);
+    pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex);
          pid = create_imcnprocess(ipar->ha_state);
          ipar->pid = pid;
-        osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+        pthread_cleanup_pop(1);

          /* Wait for child process to exit */
          do {
@@ -344,7 +359,9 @@ int stop_ntfimcn(void)

      if (ipar.ha_state != 0) {
          TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+    osaf_mutex_lock_ordie(&ntfimcn_mutex);
          rc = pthread_cancel(ipar.thread);
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
          if (rc != 0)
              osaf_abort(rc);
          rc = pthread_join(ipar.thread, &join_ret);

On 22/06/17 18:57, Lennart Lund wrote:
> Hi Minh
>
> This solution will probably work quite well but there is still a minor 
> risk that the mutex will be taken before the thread is cancelled and there is 
> also a risk (see Note2) "Asynchronous cancelability means that the thread can 
> be canceled at any time (usually immediately, but the system does not 
> guarantee this)"
>
> Instead:
> In the thread a cancellation function that unlocks the mutex can be pushed 
> just after the mutex is locked. This function takes the mutex as in-parameter 
> and its only function is to unlock the mutex. Instead of unlocking the mutex 
> by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code the 
> cancellation function is popped and executed (it is unlocking the mutex) See 
> pthread_cleanup_push() and pthread_cleanup_pop(). With this solution the 
> cancellation type should not be changed to ASYNCHORNOUS it shall remain 
> PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will never be 
> terminated with the mutex locked.
> Note1: Also with this solution the mutex shall be taken in the main thread 
> before rc = pthread_cancel(ipar.thread); and released just after.
>
> Note2: I found that changing to cancellation type ASYNCHORNOUS is not safe 
> because only a few are async-cancel-safe and for example waitpid() is not one 
> of the safe fuctions!
>
> Thanks
> Lennart
>
>> -----Original Message-----
>> From: Minh Chau [mailto:[email protected]]
>> Sent: den 22 juni 2017 06:14
>> To: Lennart Lund <[email protected]>; 
>> [email protected]
>> Cc: [email protected]; Minh Hon Chau 
>> <[email protected]>
>> Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after 
>> cnsurvail_thread is canceled [#2508]
>>
>> In the scenario of shutting down SC while SC switchover is on going, 
>> ntfd coredump is generated due to failure of pthread_mutex_destroy() 
>> with errorcode:16(EBUSY). That means the mutex had been taken and was 
>> not unlocked at the time phtread_mutex_destroy() is called.
>>
>> One solution is adding mutex protection for pthread_cancel() so that 
>> there's no cancellation request if cnsurvail_thread() is taking 
>> mutex, or cnsurvail_thread() can not take mutex if the thread 
>> cancellation request is issued. That also needs the cnsurvail_thread 
>> to have the cancellation type as ASYNCHORNOUS. Otherwise the same 
>> coredump issue still occurs since the cancellation request is 
>> deffered (cancellation type as PTHREAD_CANCEL_DEFERRED set by 
>> default)
>> ---
>>   src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/src/ntf/ntfd/ntfs_imcnutil.c 
>> b/src/ntf/ntfd/ntfs_imcnutil.c index dd27a255c..672a3910e 100644
>> --- a/src/ntf/ntfd/ntfs_imcnutil.c
>> +++ b/src/ntf/ntfd/ntfs_imcnutil.c
>> @@ -238,6 +238,10 @@ static void *cnsurvail_thread(void 
>> *_init_params)
>>
>>      TRACE_ENTER();
>>
>> +    pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status);
>> +    pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
>> &status);
>> +    TRACE("old cancellation type: %d", status);
>> +
>>      while (1) {
>>              osaf_mutex_lock_ordie(&ntfimcn_mutex);
>>              pid = create_imcnprocess(ipar->ha_state);
>> @@ -344,7 +348,9 @@ int stop_ntfimcn(void)
>>
>>      if (ipar.ha_state != 0) {
>>              TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
>> +            osaf_mutex_lock_ordie(&ntfimcn_mutex);
>>              rc = pthread_cancel(ipar.thread);
>> +            osaf_mutex_unlock_ordie(&ntfimcn_mutex);
>>              if (rc != 0)
>>                      osaf_abort(rc);
>>              rc = pthread_join(ipar.thread, &join_ret);
>> --
>> 2.11.0
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech 
sites, Slashdot.org! http://sdm.link/slashdot 
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to