Hi Lennart,

Please see my comments inline [Minh]

Thanks,
Minh

On 27/06/17 00:16, Lennart Lund wrote:
Hi Minh

Yes, this is what I suggested but see my comments inline [Lennart]

It is not a good idea to use ASYNCHRONOUS since some functions used in the thread is not 
allowed if this is the cancellation type. See "Async-cancel-safe functions" in 
pthreads section in Linux man pages.
[Minh] So we need to re-ensure ASYNCHRONOUS is also not used in somewhere else in our code.
However, PTHREAD_CANCEL_DEFERRED do create a potential problem. If the 
cancellation request is given when the thread is no longer waiting in waitpid() 
but before it has taken the mutex the thread will then reach 
create_imcnprocess() that will create the imcnprocess. To be able to kill the 
process the Pid is needed and for that purpose the Pid saved in the ipar 
structure is used. However the pid is not saved until after the 
create_imcnprocess() returns. If for some reason the thread is cancelled after 
the process is created but before the Pid is saved the process will not be 
possible to kill (we have no Pid). The current code does not have a 
cancellation point between the fork() and saving of the pid so this should not 
happen but you never know what will happen to the code in the future so I 
suggest that pthread_testcancel() is called just before fork() is called.
[Minh]: I think this ticket is originated from 2 problems:
1. Stopping surveillance should be in reverse order of starting it. That means, start sequence currently is: ntfd mainthread -> cnsurvail_thread() -> ntfimcnd. However, stop sequence is: cnsurvail_thread -> ntfimcnd -> ntfd_mainthread. So that should be the reason of the above problem you mentioned, where we don't Pid to kill ntfimcnd 2. In stopping sequence from top down view of surveillance, the main thread is trying to stop cnsurvail_thread, while the cnsurvail_thread is going to start ntfimcnd. This is causing trouble and there should be someway to get cnsurvail_thread being aware that it does not start ntfimcnd in stopping sequence.

Attaching a simple patch that solves 2 above problems, so we don't have to add too much threading stuffs for this ticket

As I understand the original problem is that the ntfd process is aborted 
because it is trying to destroy the ntfimcn_mutex while it is locked by the 
thread but why is this happening?
As far as I can see there is no cancellation point between where the mutex is 
locked and unlocked and the cancellation type is PTHREAD_CANCEL_DEFERRED. The 
only cancellation point in the thread is the waitpid().
[Minh]: This problem happened I think because one of other cancellation points inside log/trace get called, where create_imcnprocess() does LOG_ER/TRACE function.

Thanks
Lennart

-----Original Message-----
From: minh chau [mailto:[email protected]]
Sent: den 23 juni 2017 07:42
To: Lennart Lund<[email protected]>;
[email protected]
Cc:[email protected]; Anders Widell
<[email protected]>
Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]

Hi Lennart,

Using ASYNCHRONOUS in this thread does not sound that bad, since the
thread does not deal with IO, allocation/deallocation, ...
By the way, is the patch below what your suggestion if I understand it
right?

Thanks,
Minh

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c
index dd27a255c..7f817075f 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT
ha_state)
       TRACE_LEAVE();
       return i_pid;
   }
+/*
+ * Cleanup handler for cnsurvail thread
+ * @param: arg[in]: a mutex
+ */
+void cnsurvail_cleanup_handler(void* arg)
+{
+  pthread_mutex_t* mutex = (pthread_mutex_t*)(arg);
+  TRACE_ENTER();
+  if (mutex) {
+    TRACE("osaf_mutex_unlock_ordie");
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
[Lennart] This seems incorrect. Should be: mutex_unlock_ordie(mutex);
+  }
+  TRACE_LEAVE();
+}

   /**
    * Thread:
@@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params)

       while (1) {
           osaf_mutex_lock_ordie(&ntfimcn_mutex);
+    pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex);
           pid = create_imcnprocess(ipar->ha_state);
           ipar->pid = pid;
-        osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+        pthread_cleanup_pop(1);

           /* Wait for child process to exit */
           do {
@@ -344,7 +359,9 @@ int stop_ntfimcn(void)

       if (ipar.ha_state != 0) {
           TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+    osaf_mutex_lock_ordie(&ntfimcn_mutex);
           rc = pthread_cancel(ipar.thread);
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
           if (rc != 0)
               osaf_abort(rc);
           rc = pthread_join(ipar.thread, &join_ret);

On 22/06/17 18:57, Lennart Lund wrote:
Hi Minh

This solution will probably work quite well but there is still a minor risk that
the mutex will be taken before the thread is cancelled and there is also a risk
(see Note2)
"Asynchronous cancelability means that the thread can be canceled at any
time (usually immediately, but the system does not guarantee this)"
Instead:
In the thread a cancellation function that unlocks the mutex can be pushed
just after the mutex is locked. This function takes the mutex as in-parameter
and its only function is to unlock the mutex. Instead of unlocking the mutex
by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code
the cancellation function is popped and executed (it is unlocking the mutex)
See pthread_cleanup_push() and pthread_cleanup_pop(). With this solution
the cancellation type should not be changed to ASYNCHORNOUS it shall
remain PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will
never be terminated with the mutex locked.
Note1: Also with this solution the mutex shall be taken in the main thread
before rc = pthread_cancel(ipar.thread); and released just after.
Note2: I found that changing to cancellation type ASYNCHORNOUS is not
safe because only a few are async-cancel-safe and for example waitpid() is
not one of the safe fuctions!
Thanks
Lennart

-----Original Message-----
From: Minh Chau [mailto:[email protected]]
Sent: den 22 juni 2017 06:14
To: Lennart Lund<[email protected]>;
[email protected]
Cc:[email protected]; Minh Hon Chau
<[email protected]>
Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread
is canceled [#2508]

In the scenario of shutting down SC while SC switchover is on going,
ntfd coredump is generated due to failure of pthread_mutex_destroy()
with errorcode:16(EBUSY). That means the mutex had been taken and
was not unlocked at the time phtread_mutex_destroy() is called.

One solution is adding mutex protection for pthread_cancel() so that
there's no cancellation request if cnsurvail_thread() is taking mutex,
or cnsurvail_thread() can not take mutex if the thread cancellation
request is issued. That also needs the cnsurvail_thread to have the
cancellation type as ASYNCHORNOUS. Otherwise the same coredump
issue
still occurs since the cancellation request is deffered (cancellation
type as PTHREAD_CANCEL_DEFERRED set by default)
---
   src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++
   1 file changed, 6 insertions(+)

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c
index dd27a255c..672a3910e 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -238,6 +238,10 @@ static void *cnsurvail_thread(void *_init_params)

        TRACE_ENTER();

+       pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status);
+       pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
&status);
+       TRACE("old cancellation type: %d", status);
+
        while (1) {
                osaf_mutex_lock_ordie(&ntfimcn_mutex);
                pid = create_imcnprocess(ipar->ha_state);
@@ -344,7 +348,9 @@ int stop_ntfimcn(void)

        if (ipar.ha_state != 0) {
                TRACE("%s: Cancel the imcn surveillance thread",
__FUNCTION__);
+               osaf_mutex_lock_ordie(&ntfimcn_mutex);
                rc = pthread_cancel(ipar.thread);
+               osaf_mutex_unlock_ordie(&ntfimcn_mutex);
                if (rc != 0)
                        osaf_abort(rc);
                rc = pthread_join(ipar.thread, &join_ret);
--
2.11.0

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c
index dd27a255c..0f330e6b1 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -240,8 +240,10 @@ static void *cnsurvail_thread(void *_init_params)
 
 	while (1) {
 		osaf_mutex_lock_ordie(&ntfimcn_mutex);
-		pid = create_imcnprocess(ipar->ha_state);
-		ipar->pid = pid;
+		if (ipar->ha_state != 0) {
+			pid = create_imcnprocess(ipar->ha_state);
+			ipar->pid = pid;
+		}
 		osaf_mutex_unlock_ordie(&ntfimcn_mutex);
 
 		/* Wait for child process to exit */
@@ -342,21 +344,24 @@ int stop_ntfimcn(void)
 	int rc = 0;
 	TRACE_ENTER();
 
+	osaf_mutex_lock_ordie(&ntfimcn_mutex);
 	if (ipar.ha_state != 0) {
-		TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
-		rc = pthread_cancel(ipar.thread);
-		if (rc != 0)
-			osaf_abort(rc);
-		rc = pthread_join(ipar.thread, &join_ret);
-		if (rc != 0)
-			osaf_abort(rc);
-		rc = pthread_mutex_destroy(&ntfimcn_mutex);
-		if (rc != 0)
-			osaf_abort(rc);
-
+		ipar.ha_state = 0;
 		TRACE("%s: Terminating osafntfimcnd process", __FUNCTION__);
 		timedwait_imcn_exit();
 	}
+	osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+
+	TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+	rc = pthread_cancel(ipar.thread);
+	if (rc != 0)
+		osaf_abort(rc);
+	rc = pthread_join(ipar.thread, &join_ret);
+	if (rc != 0)
+		osaf_abort(rc);
+	rc = pthread_mutex_destroy(&ntfimcn_mutex);
+	if (rc != 0)
+		osaf_abort(rc);
 
 	TRACE_LEAVE();
 	return rc;
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to