We saw an issue in a production server on a customer deployment where DLM 4.0.7 gets "stuck" and unable to join new lockspaces.
See - https://lists.clusterlabs.org/pipermail/users/2019-January/016054.html This was forwarded off list to David Teigland who responded thusly. " Hi, thanks for the debugging info. You've spent more time looking at this than I have, but from a first glance it seems to me that the initial problem (there may be multiple) is that in the kernel, lockspace.c do_event() does not sensibly handle the ERESTARTSYS error from wait_event_interruptible(). I think do_event() should continue waiting for a uevent result from userspace until it gets one, because the kernel can't do anything sensible until it gets that. Dave " The previous attempt at fixing this was NAKed by Linus since it could cause a busy-wait loop. Instead, just switch wait_event_interruptible() to wait_event(). Signed-off-by: Ross Lagerwall <[email protected]> --- fs/dlm/lockspace.c | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c index afb8340918b8..e93670ecfae5 100644 --- a/fs/dlm/lockspace.c +++ b/fs/dlm/lockspace.c @@ -197,8 +197,6 @@ static struct kset *dlm_kset; static int do_uevent(struct dlm_ls *ls, int in) { - int error; - if (in) kobject_uevent(&ls->ls_kobj, KOBJ_ONLINE); else @@ -209,20 +207,12 @@ static int do_uevent(struct dlm_ls *ls, int in) /* dlm_controld will see the uevent, do the necessary group management and then write to sysfs to wake us */ - error = wait_event_interruptible(ls->ls_uevent_wait, - test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); + wait_event(ls->ls_uevent_wait, + test_and_clear_bit(LSFL_UEVENT_WAIT, &ls->ls_flags)); - log_rinfo(ls, "group event done %d %d", error, ls->ls_uevent_result); - - if (error) - goto out; + log_rinfo(ls, "group event done %d", ls->ls_uevent_result); - error = ls->ls_uevent_result; - out: - if (error) - log_error(ls, "group %s failed %d %d", in ? "join" : "leave", - error, ls->ls_uevent_result); - return error; + return ls->ls_uevent_result; } static int dlm_uevent(struct kset *kset, struct kobject *kobj, -- 2.21.1
