Hi Lennart,

I will add the comments and fix naming in V2 patch.

Thanks,
Minh

On 27/06/17 20:09, Lennart Lund wrote:
Hi Minh

See my comment inline [Lennart]

Thanks
Lennart

-----Original Message-----
From: minh chau [mailto:[email protected]]
Sent: den 27 juni 2017 07:28
To: Lennart Lund <[email protected]>;
[email protected]
Cc: [email protected]; Anders Widell
<[email protected]>
Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]

Hi Lennart,

Please see my comments inline [Minh]

Thanks,
Minh

On 27/06/17 00:16, Lennart Lund wrote:
Hi Minh

Yes, this is what I suggested but see my comments inline [Lennart]

It is not a good idea to use ASYNCHRONOUS since some functions used in
the thread is not allowed if this is the cancellation type. See "Async-cancel-
safe functions" in pthreads section in Linux man pages.
[Minh] So we need to re-ensure ASYNCHRONOUS is also not used in
somewhere else in our code.
However, PTHREAD_CANCEL_DEFERRED do create a potential problem. If
the cancellation request is given when the thread is no longer waiting in
waitpid() but before it has taken the mutex the thread will then reach
create_imcnprocess() that will create the imcnprocess. To be able to kill the
process the Pid is needed and for that purpose the Pid saved in the ipar
structure is used. However the pid is not saved until after the
create_imcnprocess() returns. If for some reason the thread is cancelled
after the process is created but before the Pid is saved the process will not
be possible to kill (we have no Pid). The current code does not have a
cancellation point between the fork() and saving of the pid so this should not
happen but you never know what will happen to the code in the future so I
suggest that pthread_testcancel() is called just before fork() is called.
[Minh]: I think this ticket is originated from 2 problems:
1. Stopping surveillance should be in reverse order of starting it. That
means, start sequence currently is: ntfd mainthread ->
cnsurvail_thread() -> ntfimcnd. However, stop sequence is:
cnsurvail_thread -> ntfimcnd -> ntfd_mainthread. So that should be the
reason of the above problem you mentioned, where we don't Pid to kill
ntfimcnd
2. In stopping sequence from top down view of surveillance, the main
thread is trying to stop cnsurvail_thread, while the cnsurvail_thread is
going to start ntfimcnd. This is causing trouble and there should be
someway to get cnsurvail_thread being aware that it does not start
ntfimcnd in stopping sequence.

Attaching a simple patch that solves 2 above problems, so we don't have
to add too much threading stuffs for this ticket
[Lennart] The solution in your patch seems to do the job.
However it is cryptic and hard to understand. This could be fixed by adding 
simple comments in both the thread and stop functions. Re-using the ha_state 
flag here for a completely different purpose than it is intended for and named 
to inform about is most confusing and misleading!
As I understand the original problem is that the ntfd process is aborted
because it is trying to destroy the ntfimcn_mutex while it is locked by the
thread but why is this happening?
As far as I can see there is no cancellation point between where the mutex
is locked and unlocked and the cancellation type is
PTHREAD_CANCEL_DEFERRED. The only cancellation point in the thread is the
waitpid().
[Minh]: This problem happened I think because one of other cancellation
points inside log/trace get called, where create_imcnprocess() does
LOG_ER/TRACE function.
Thanks
Lennart

-----Original Message-----
From: minh chau [mailto:[email protected]]
Sent: den 23 juni 2017 07:42
To: Lennart Lund<[email protected]>;
[email protected]
Cc:[email protected]; Anders Widell
<[email protected]>
Subject: Re: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]

Hi Lennart,

Using ASYNCHRONOUS in this thread does not sound that bad, since the
thread does not deal with IO, allocation/deallocation, ...
By the way, is the patch below what your suggestion if I understand it
right?

Thanks,
Minh

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c
index dd27a255c..7f817075f 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT
ha_state)
        TRACE_LEAVE();
        return i_pid;
    }
+/*
+ * Cleanup handler for cnsurvail thread
+ * @param: arg[in]: a mutex
+ */
+void cnsurvail_cleanup_handler(void* arg)
+{
+  pthread_mutex_t* mutex = (pthread_mutex_t*)(arg);
+  TRACE_ENTER();
+  if (mutex) {
+    TRACE("osaf_mutex_unlock_ordie");
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
[Lennart] This seems incorrect. Should be: mutex_unlock_ordie(mutex);
+  }
+  TRACE_LEAVE();
+}

    /**
     * Thread:
@@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params)

        while (1) {
            osaf_mutex_lock_ordie(&ntfimcn_mutex);
+    pthread_cleanup_push(cnsurvail_cleanup_handler,
&ntfimcn_mutex);
            pid = create_imcnprocess(ipar->ha_state);
            ipar->pid = pid;
-        osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+        pthread_cleanup_pop(1);

            /* Wait for child process to exit */
            do {
@@ -344,7 +359,9 @@ int stop_ntfimcn(void)

        if (ipar.ha_state != 0) {
            TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+    osaf_mutex_lock_ordie(&ntfimcn_mutex);
            rc = pthread_cancel(ipar.thread);
+    osaf_mutex_unlock_ordie(&ntfimcn_mutex);
            if (rc != 0)
                osaf_abort(rc);
            rc = pthread_join(ipar.thread, &join_ret);

On 22/06/17 18:57, Lennart Lund wrote:
Hi Minh

This solution will probably work quite well but there is still a minor risk
that
the mutex will be taken before the thread is cancelled and there is also a
risk
(see Note2)
"Asynchronous cancelability means that the thread can be canceled at
any
time (usually immediately, but the system does not guarantee this)"
Instead:
In the thread a cancellation function that unlocks the mutex can be
pushed
just after the mutex is locked. This function takes the mutex as in-
parameter
and its only function is to unlock the mutex. Instead of unlocking the
mutex
by calling osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current
code
the cancellation function is popped and executed (it is unlocking the
mutex)
See pthread_cleanup_push() and pthread_cleanup_pop(). With this
solution
the cancellation type should not be changed to ASYNCHORNOUS it shall
remain PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread
will
never be terminated with the mutex locked.
Note1: Also with this solution the mutex shall be taken in the main
thread
before rc = pthread_cancel(ipar.thread); and released just after.
Note2: I found that changing to cancellation type ASYNCHORNOUS is not
safe because only a few are async-cancel-safe and for example waitpid()
is
not one of the safe fuctions!
Thanks
Lennart

-----Original Message-----
From: Minh Chau [mailto:[email protected]]
Sent: den 22 juni 2017 06:14
To: Lennart Lund<[email protected]>;
[email protected]
Cc:[email protected]; Minh Hon Chau
<[email protected]>
Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread
is canceled [#2508]

In the scenario of shutting down SC while SC switchover is on going,
ntfd coredump is generated due to failure of
pthread_mutex_destroy()
with errorcode:16(EBUSY). That means the mutex had been taken and
was not unlocked at the time phtread_mutex_destroy() is called.

One solution is adding mutex protection for pthread_cancel() so that
there's no cancellation request if cnsurvail_thread() is taking mutex,
or cnsurvail_thread() can not take mutex if the thread cancellation
request is issued. That also needs the cnsurvail_thread to have the
cancellation type as ASYNCHORNOUS. Otherwise the same coredump
issue
still occurs since the cancellation request is deffered (cancellation
type as PTHREAD_CANCEL_DEFERRED set by default)
---
    src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++
    1 file changed, 6 insertions(+)

diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c
index dd27a255c..672a3910e 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -238,6 +238,10 @@ static void *cnsurvail_thread(void
*_init_params)
        TRACE_ENTER();

+       pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status);
+       pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
&status);
+       TRACE("old cancellation type: %d", status);
+
        while (1) {
                osaf_mutex_lock_ordie(&ntfimcn_mutex);
                pid = create_imcnprocess(ipar->ha_state);
@@ -344,7 +348,9 @@ int stop_ntfimcn(void)

        if (ipar.ha_state != 0) {
                TRACE("%s: Cancel the imcn surveillance thread",
__FUNCTION__);
+               osaf_mutex_lock_ordie(&ntfimcn_mutex);
                rc = pthread_cancel(ipar.thread);
+               osaf_mutex_unlock_ordie(&ntfimcn_mutex);
                if (rc != 0)
                        osaf_abort(rc);
                rc = pthread_join(ipar.thread, &join_ret);
--
2.11.0



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to