We saw an issue in a production server on a customer deployment where
DLM 4.0.7 gets "stuck" and unable to join new lockspaces.

See - https://lists.clusterlabs.org/pipermail/users/2019-January/016054.html

This was forwarded off list to David Teigland who responded thusly.

"
Hi, thanks for the debugging info.  You've spent more time looking at
this than I have, but from a first glance it seems to me that the
initial problem (there may be multiple) is that in the kernel,
lockspace.c do_event() does not sensibly handle the ERESTARTSYS error
from wait_event_interruptible().  I think do_event() should continue
waiting for a uevent result from userspace until it gets one, because
the kernel can't do anything sensible until it gets that.

Dave
"

The previous attempt at fixing this was NAKed by Linus since it could
cause a busy-wait loop. Instead, just switch wait_event_interruptible()
to wait_event().

Signed-off-by: Ross Lagerwall <[email protected]>
---
 fs/dlm/lockspace.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index afb8340918b8..e93670ecfae5 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -197,8 +197,6 @@ static struct kset *dlm_kset;
 
 static int do_uevent(struct dlm_ls *ls, int in)
 {
-       int error;
-
        if (in)
                kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE);
        else
@@ -209,20 +207,12 @@ static int do_uevent(struct dlm_ls *ls, int in)
        /* dlm_controld will see the uevent, do the necessary group management
           and then write to sysfs to wake us */
 
-       error = wait_event_interruptible(ls->ls_uevent_wait,
-                       test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
+       wait_event(ls->ls_uevent_wait,
+                  test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags));
 
-       log_rinfo(ls, "group event done %d %d", error, ls->ls_uevent_result);
-
-       if (error)
-               goto out;
+       log_rinfo(ls, "group event done %d", ls->ls_uevent_result);
 
-       error = ls->ls_uevent_result;
- out:
-       if (error)
-               log_error(ls, "group %s failed %d %d", in ? "join" : "leave",
-                         error, ls->ls_uevent_result);
-       return error;
+       return ls->ls_uevent_result;
 }
 
 static int dlm_uevent(struct kset *kset, struct kobject *kobj,
-- 
2.21.1


Reply via email to