Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-16 Thread David Teigland
On Mon, Aug 16, 2021 at 09:41:18AM -0500, David Teigland wrote:
> On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> > Hi David,
> > 
> > On 2021/8/13 1:45, David Teigland wrote:
> > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > > In fact, I can reproduce this problem stably.
> > > > I want to know if this error happen is by our expectation? since there 
> > > > is
> > > > not any extreme pressure test.
> > > > Second, how should we handle these error cases? call dlm_lock function
> > > > again? maybe the function will fails again, that will lead to kernel
> > > > soft-lockup after multiple re-tries.
> > > 
> > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > > an in-progress dlm_lock() request.  Before the cancel completes (or the
> > > original request completes), ocfs2 calls dlm_lock() again on the same
> > > resource.  This dlm_lock() returns -EBUSY because the previous request has
> > > not completed, either normally or by cancellation.  This is expected.
> > These dlm_lock and dlm_unlock are invoked in the same node, or the different
> > nodes?
> 
> different

Sorry, same node



Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-16 Thread David Teigland
On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote:
> Hi David,
> 
> On 2021/8/13 1:45, David Teigland wrote:
> > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> > > In fact, I can reproduce this problem stably.
> > > I want to know if this error happen is by our expectation? since there is
> > > not any extreme pressure test.
> > > Second, how should we handle these error cases? call dlm_lock function
> > > again? maybe the function will fails again, that will lead to kernel
> > > soft-lockup after multiple re-tries.
> > 
> > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
> > an in-progress dlm_lock() request.  Before the cancel completes (or the
> > original request completes), ocfs2 calls dlm_lock() again on the same
> > resource.  This dlm_lock() returns -EBUSY because the previous request has
> > not completed, either normally or by cancellation.  This is expected.
> These dlm_lock and dlm_unlock are invoked in the same node, or the different
> nodes?

different

> > A couple options to try: wait for the original request to complete
> > (normally or by cancellation) before calling dlm_lock() again, or retry
> > dlm_lock() on -EBUSY.
> If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel
> soft lockup or waste lots of CPU.

I'm not aware of other code doing this, so I can't tell you with certainty.
It would depend largely on the implementation in the caller.

> If dlm_lock() function returns -EAGAIN, how should we handle this case?
> retry it repeatedly?

Again, this is a question more about the implementation of the calling
code and what it wants to do.  EAGAIN is specifically related to the
DLM_LKF_NOQUEUE flag.



Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-13 Thread Gang He

Hi David,

On 2021/8/13 1:45, David Teigland wrote:

On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there is
not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function
again? maybe the function will fails again, that will lead to kernel
soft-lockup after multiple re-tries.


What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request.  Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource.  This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation.  This is expected.
These dlm_lock and dlm_unlock are invoked in the same node, or the 
different nodes?




A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.
If I retry dlm_lock() repeatedly, I just wonder if this will lead to 
kernel soft lockup or waste lots of CPU.

If dlm_lock() function returns -EAGAIN, how should we handle this case?
retry it repeatedly?

Thanks
Gang



Dave





Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-12 Thread David Teigland
On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> In fact, I can reproduce this problem stably.
> I want to know if this error happen is by our expectation? since there is
> not any extreme pressure test.
> Second, how should we handle these error cases? call dlm_lock function
> again? maybe the function will fails again, that will lead to kernel
> soft-lockup after multiple re-tries.

What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request.  Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource.  This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation.  This is expected.

A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.

Dave



Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-12 Thread Gang He

Hi Alexander,


On 2021/8/12 4:35, Alexander Aring wrote:

Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He  wrote:


Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?


What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?

ocfs2 file system.



I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
validate_lock_args -16 10 10 10c 2 0 M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error 
-16 while calling ocfs2_dlm_lock on resource M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap


The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there 
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function 
again? maybe the function will fails again, that will lead to kernel 
soft-lockup after multiple re-tries.


Thanks
Gang



Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886





Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-11 Thread Alexander Aring
Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He  wrote:
>
> Hello List,
>
> I am using kernel 5.13.4 (some old version kernels have the same problem).
> When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node 
> A got a BAST message,
> then node A downcoverted the dlm lock to NL, dlm_lock function failed with 
> the error -16.
> The function failure did not always happen, but in some case, I could 
> encounter this failure.
> Why does dlm_lock function fails when downconvert a dlm lock? there are any 
> documents for describe these error cases?
> If the code ignores dlm_lock return error from node A, node B will not get 
> the dlm lock permanently.
> How should we handle such situation? call dlm_lock function to downconvert 
> the dlm lock again?

What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?

I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.

Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886



[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-11 Thread Gang He
Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure. 
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?

Thanks
Gang