Hi Alexander,

On 2021/8/12 4:35, Alexander Aring wrote:
Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He <g...@suse.com> wrote:

Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?

What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
ocfs2 file system.


I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 100000 [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 100000 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: validate_lock_args -16 10 100000 10c 2 0 M0000000000000000046e0200000000 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error -16 while calling ocfs2_dlm_lock on resource M0000000000000000046e0200000000 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap

The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there is not any extreme pressure test. Second, how should we handle these error cases? call dlm_lock function again? maybe the function will fails again, that will lead to kernel soft-lockup after multiple re-tries.

Thanks
Gang


Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886


Reply via email to