Hi Hans,
The problem happened at the first place was without trace, I'm guessing
because of LOG_ER, where write() get hit and the error string might not
have been printed
static pid_t create_imcnprocess(SaAmfHAStateT ha_state) {
...
i_pid = fork();
if (i_pid == -1) {
* LOG_ER("Failed to fork %s", strerror(errno));*
abort();
} else if (i_pid == 0) {
...
}
The main thread is getting SIG_TERM and its child thread is doing fork()
I have sent another patch to prevent this happen, I think Lennart is
also checking the idea
If join_ret is not used, NULL should be passed, otherwise it should
check against PTHREAD_CANCELED.
Thanks!
Minh
On 27/06/17 17:23, Hans Nordebäck wrote:
Hi Minh,
Is trace enabled when this problem occurs? If so trace adds cancellation
points, e.g. at fork/exec in create_imcnprocess.
In stop_ntfimcn, pthread_join the join_ret variable is not used, use NULL
instead, looking at the backtrace in the ticket, it's value is -1.
/BR Hans
-----Original Message-----
From: minh chau [mailto:[email protected]]
Sent: den 23 juni 2017 07:42
To: Lennart Lund <[email protected]>; [email protected]
Cc: [email protected]
Subject: Re: [devel] [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]
Hi Lennart,
Using ASYNCHRONOUS in this thread does not sound that bad, since the thread
does not deal with IO, allocation/deallocation, ...
By the way, is the patch below what your suggestion if I understand it right?
Thanks,
Minh
diff --git a/src/ntf/ntfd/ntfs_imcnutil.c b/src/ntf/ntfd/ntfs_imcnutil.c index
dd27a255c..7f817075f 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -220,6 +220,20 @@ static pid_t create_imcnprocess(SaAmfHAStateT ha_state)
TRACE_LEAVE();
return i_pid;
}
+/*
+ * Cleanup handler for cnsurvail thread
+ * @param: arg[in]: a mutex
+ */
+void cnsurvail_cleanup_handler(void* arg) {
+ pthread_mutex_t* mutex = (pthread_mutex_t*)(arg);
+ TRACE_ENTER();
+ if (mutex) {
+ TRACE("osaf_mutex_unlock_ordie");
+ osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+ }
+ TRACE_LEAVE();
+}
/**
* Thread:
@@ -240,9 +254,10 @@ static void *cnsurvail_thread(void *_init_params)
while (1) {
osaf_mutex_lock_ordie(&ntfimcn_mutex);
+ pthread_cleanup_push(cnsurvail_cleanup_handler, &ntfimcn_mutex);
pid = create_imcnprocess(ipar->ha_state);
ipar->pid = pid;
- osaf_mutex_unlock_ordie(&ntfimcn_mutex);
+ pthread_cleanup_pop(1);
/* Wait for child process to exit */
do {
@@ -344,7 +359,9 @@ int stop_ntfimcn(void)
if (ipar.ha_state != 0) {
TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+ osaf_mutex_lock_ordie(&ntfimcn_mutex);
rc = pthread_cancel(ipar.thread);
+ osaf_mutex_unlock_ordie(&ntfimcn_mutex);
if (rc != 0)
osaf_abort(rc);
rc = pthread_join(ipar.thread, &join_ret);
On 22/06/17 18:57, Lennart Lund wrote:
Hi Minh
This solution will probably work quite well but there is still a minor
risk that the mutex will be taken before the thread is cancelled and there is also a risk
(see Note2) "Asynchronous cancelability means that the thread can be canceled at any
time (usually immediately, but the system does not guarantee this)"
Instead:
In the thread a cancellation function that unlocks the mutex can be pushed just
after the mutex is locked. This function takes the mutex as in-parameter and its
only function is to unlock the mutex. Instead of unlocking the mutex by calling
osaf_mutex_unlock_ordie(&ntfimcn_mutex) as in the current code the cancellation
function is popped and executed (it is unlocking the mutex) See
pthread_cleanup_push() and pthread_cleanup_pop(). With this solution the
cancellation type should not be changed to ASYNCHORNOUS it shall remain
PTHREAD_CANCEL_DEFERRED. This will guarantee that the thread will never be
terminated with the mutex locked.
Note1: Also with this solution the mutex shall be taken in the main thread
before rc = pthread_cancel(ipar.thread); and released just after.
Note2: I found that changing to cancellation type ASYNCHORNOUS is not safe
because only a few are async-cancel-safe and for example waitpid() is not one
of the safe fuctions!
Thanks
Lennart
-----Original Message-----
From: Minh Chau [mailto:[email protected]]
Sent: den 22 juni 2017 06:14
To: Lennart Lund <[email protected]>;
[email protected]
Cc: [email protected]; Minh Hon Chau
<[email protected]>
Subject: [PATCH 1/1] ntfd: Ensure mutex is not taken after
cnsurvail_thread is canceled [#2508]
In the scenario of shutting down SC while SC switchover is on going,
ntfd coredump is generated due to failure of pthread_mutex_destroy()
with errorcode:16(EBUSY). That means the mutex had been taken and was
not unlocked at the time phtread_mutex_destroy() is called.
One solution is adding mutex protection for pthread_cancel() so that
there's no cancellation request if cnsurvail_thread() is taking
mutex, or cnsurvail_thread() can not take mutex if the thread
cancellation request is issued. That also needs the cnsurvail_thread
to have the cancellation type as ASYNCHORNOUS. Otherwise the same
coredump issue still occurs since the cancellation request is
deffered (cancellation type as PTHREAD_CANCEL_DEFERRED set by
default)
---
src/ntf/ntfd/ntfs_imcnutil.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/ntf/ntfd/ntfs_imcnutil.c
b/src/ntf/ntfd/ntfs_imcnutil.c index dd27a255c..672a3910e 100644
--- a/src/ntf/ntfd/ntfs_imcnutil.c
+++ b/src/ntf/ntfd/ntfs_imcnutil.c
@@ -238,6 +238,10 @@ static void *cnsurvail_thread(void
*_init_params)
TRACE_ENTER();
+ pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &status);
+ pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
&status);
+ TRACE("old cancellation type: %d", status);
+
while (1) {
osaf_mutex_lock_ordie(&ntfimcn_mutex);
pid = create_imcnprocess(ipar->ha_state);
@@ -344,7 +348,9 @@ int stop_ntfimcn(void)
if (ipar.ha_state != 0) {
TRACE("%s: Cancel the imcn surveillance thread", __FUNCTION__);
+ osaf_mutex_lock_ordie(&ntfimcn_mutex);
rc = pthread_cancel(ipar.thread);
+ osaf_mutex_unlock_ordie(&ntfimcn_mutex);
if (rc != 0)
osaf_abort(rc);
rc = pthread_join(ipar.thread, &join_ret);
--
2.11.0
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech
sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel