Hi Hans,

The problem happened at the first place was without trace, I'm guessing because of LOG_ER, where write() get hit and the error string might not have been printed

static pid_t create_imcnprocess(SaAmfHAStateT ha_state) {
...
    i_pid = fork();
    if (i_pid == -1) {
*        LOG_ER("Failed to fork %s", strerror(errno));*
        abort();
    } else if (i_pid == 0) {
...
}

The main thread is getting SIG_TERM and its child thread is doing fork()
I have sent another patch to prevent this happen, I think Lennart is also checking the idea

If join_ret is not used, NULL should be passed, otherwise it should check against PTHREAD_CANCELED.

Thanks!
Minh

On 27/06/17 17:23, Hans Nordebäck wrote:
Hi Minh,

Is trace enabled when this problem occurs? If so trace adds cancellation 
points, e.g. at fork/exec in create_imcnprocess.
In stop_ntfimcn, pthread_join the join_ret variable is not used, use NULL 
instead, looking at the backtrace in the ticket, it's value is -1.

/BR Hans


-----Original Message-----
From: minh chau [mailto:[email protected]]
Sent: den 23 juni 2017 07:42
To: Lennart Lund <[email protected]>; [email protected]
Cc: [email protected]
Subject: Re: [devel] [PATCH 1/1] ntfd: Ensure mutex is not taken after 
cnsurvail_thread is canceled [#2508]

Hi Lennart,

Using ASYNCHRONOUS in this thread does not sound that bad, since the thread 
does not deal with IO, allocation/deallocation, ...
By the way, is the patch below what your suggestion if I understand it right?

Thanks,
Minh

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c index 
dd27a255c..7f817075f 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT ha_state)
       TRACE_LEAVE();
       return i_pid;
   }
+/*
+ * Cleanup handler for cnsurvail thread
+ * @param: arg[in]: a mutex
+ */
+void cnsurvail_cleanup_handler(void* arg) {
+  pthread_mutex_t* mutex = (pthread_mutex_t*)(arg);
+  TRACE_ENTER();
+  if (mutex) {
+    TRACE("osaf_mutex_unlock_ordie");
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+  }
+  TRACE_LEAVE();
+}

   /**
    * Thread:
@@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params)

       while (1) {
           osaf_mutex_lock_ordie(&ntfimcn_mutex);
+    pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex);
           pid = create_imcnprocess(ipar->ha_state);
           ipar->pid = pid;
-        osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+        pthread_cleanup_pop(1);

           /* Wait for child process to exit */
           do {
@@ -344,7 +359,9 @@ int stop_ntfimcn(void)

       if (ipar.ha_state != 0) {
           TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+    osaf_mutex_lock_ordie(&ntfimcn_mutex);
           rc = pthread_cancel(ipar.thread);
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
           if (rc != 0)
               osaf_abort(rc);
           rc = pthread_join(ipar.thread, &join_ret);

On 22/06/17 18:57, Lennart Lund wrote:
Hi Minh

This solution will probably work quite well but there is still a minor
risk that the mutex will be taken before the thread is cancelled and there is also a risk 
(see Note2) "Asynchronous cancelability means that the thread can be canceled at any 
time (usually immediately, but the system does not guarantee this)"

Instead:
In the thread a cancellation function that unlocks the mutex can be pushed just 
after the mutex is locked. This function takes the mutex as in-parameter and its 
only function is to unlock the mutex. Instead of unlocking the mutex by calling 
osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code the cancellation 
function is popped and executed (it is unlocking the mutex) See 
pthread_cleanup_push() and pthread_cleanup_pop(). With this solution the 
cancellation type should not be changed to ASYNCHORNOUS it shall remain 
PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will never be 
terminated with the mutex locked.
Note1: Also with this solution the mutex shall be taken in the main thread 
before rc = pthread_cancel(ipar.thread); and released just after.

Note2: I found that changing to cancellation type ASYNCHORNOUS is not safe 
because only a few are async-cancel-safe and for example waitpid() is not one 
of the safe fuctions!

Thanks
Lennart

-----Original Message-----
From: Minh Chau [mailto:[email protected]]
Sent: den 22 juni 2017 06:14
To: Lennart Lund <[email protected]>;
[email protected]
Cc: [email protected]; Minh Hon Chau
<[email protected]>
Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]

In the scenario of shutting down SC while SC switchover is on going,
ntfd coredump is generated due to failure of pthread_mutex_destroy()
with errorcode:16(EBUSY). That means the mutex had been taken and was
not unlocked at the time phtread_mutex_destroy() is called.

One solution is adding mutex protection for pthread_cancel() so that
there's no cancellation request if cnsurvail_thread() is taking
mutex, or cnsurvail_thread() can not take mutex if the thread
cancellation request is issued. That also needs the cnsurvail_thread
to have the cancellation type as ASYNCHORNOUS. Otherwise the same
coredump issue still occurs since the cancellation request is
deffered (cancellation type as PTHREAD_CANCEL_DEFERRED set by
default)
---
   src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++
   1 file changed, 6 insertions(+)

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c
b/src/ntf/ntfd/ntfs_imcnutil.c index dd27a255c..672a3910e 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -238,6 +238,10 @@ static void *cnsurvail_thread(void
*_init_params)

        TRACE_ENTER();

+       pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status);
+       pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
&status);
+       TRACE("old cancellation type: %d", status);
+
        while (1) {
                osaf_mutex_lock_ordie(&ntfimcn_mutex);
                pid = create_imcnprocess(ipar->ha_state);
@@ -344,7 +348,9 @@ int stop_ntfimcn(void)

        if (ipar.ha_state != 0) {
                TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+               osaf_mutex_lock_ordie(&ntfimcn_mutex);
                rc = pthread_cancel(ipar.thread);
+               osaf_mutex_unlock_ordie(&ntfimcn_mutex);
                if (rc != 0)
                        osaf_abort(rc);
                rc = pthread_join(ipar.thread, &join_ret);
--
2.11.0

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech 
sites, Slashdot.org! http://sdm.link/slashdot 
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to