Re: [Cluster-devel] FS/DLM module triggered kernel BUG

2021-08-24 Thread Gang He




On 2021/8/23 21:49, Alexander Aring wrote:

Hi Gang He,

On Mon, Aug 23, 2021 at 1:43 AM Gang He  wrote:


Hello Guys,

I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel 
BUG.


What do you exactly do? I would like to test it on a recent upstream
version, or you can do it?

I am not specifically to test the dlm kernel module.
I am doing ocfs2 related testing with opensuse Tumbleweed, which 
includes a very new kernel version.
But sometimes the ocfs2 test cases were blocked/aborted, due to this DLM 
problem.





Since the dlm kernel module is not the latest source code, I am not sure if 
this problem is fixed, or not.



could be, see below.


The backtrace is as below,

[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 
172204615
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_members 2 nodes
[Fri Aug 20 16:24:14 2021] dlm: connection 5ef82293 got EOF from 
172204615


here we disconnect from nodeid 172204615.


[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 
slots 2 1:172204786 2:172204748
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory 8 in 1 new
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory 1 out 1 messages
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_masters
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_masters 33587 of 33599
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_locks 0 out
[Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: 
dd99ffd16650
[Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
[Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
[Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
[Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
[Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G  
  W 5.13.8-1-default #1 openSUSE Tumbleweed
[Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
[Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
[Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 
1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 
48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
[Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282
[Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX: 
0830
[Fri Aug 20 16:24:14 2021] RDX:  RSI: 0f48 RDI: 
c06b4420
[Fri Aug 20 16:24:14 2021] RBP: a0d028423974 R08: 0001 R09: 
0004
[Fri Aug 20 16:24:14 2021] R10:  R11:  R12: 
a0d028425000
[Fri Aug 20 16:24:14 2021] R13: 0a43a2f2 R14: a0d028425770 R15: 
0a43a2f2
[Fri Aug 20 16:24:14 2021] FS:  () 
GS:a0d03ed0() knlGS:
[Fri Aug 20 16:24:14 2021] CS:  0010 DS:  ES:  CR0: 80050033
[Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650 CR3: 02696000 CR4: 
000406e0
[Fri Aug 20 16:24:14 2021] Call Trace:
[Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]


It would be interesting if we got here some message from nodeid
172204615 and I think this is what happens. There is maybe some use
after free going on and we should not receive anymore messages from
nodeid 172204615.
I recently added some dlm tracing infrastructure. It should be simple
to add a trace event here, print out the nodeid and compare
timestamps.

I recently fixed a synchronization issue which is not part of kernel
5.13.8 and has something to do with what you are seeing here.
There exists a workaround or a simple test if this really affects you,
simply create a dummy lockspace on all nodes so we actually never do
any disconnects and look if you are running again into this issue.
What is this git commit? I do not want to see any kernel (warning) print 
about DLM kernel module. Sometimes, DLM would enter a stuck state after 
the DLM kernel print.
Since there were a few commits in the past weeks, I just wonder if there 
is any regression problem.


Thanks
Gang





[Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
[Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
[Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
[Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
[Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
[Fri Aug 20 16:24:14 2021]  ? process_on

[Cluster-devel] FS/DLM module triggered kernel BUG

2021-08-23 Thread Gang He
Hello Guys,

I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel 
BUG.
Since the dlm kernel module is not the latest source code, I am not sure if 
this problem is fixed, or not. 

The backtrace is as below,

[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 
172204615
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_members 2 nodes
[Fri Aug 20 16:24:14 2021] dlm: connection 5ef82293 got EOF from 
172204615
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 
slots 2 1:172204786 2:172204748
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory 8 in 1 new
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_directory 1 out 1 messages
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_masters
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_masters 33587 of 33599
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: 
dlm_recover_locks 0 out
[Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: 
dd99ffd16650
[Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
[Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
[Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
[Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
[Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G  
  W 5.13.8-1-default #1 openSUSE Tumbleweed
[Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
[Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
[Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 
c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 
f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
[Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282
[Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX: 
0830
[Fri Aug 20 16:24:14 2021] RDX:  RSI: 0f48 RDI: 
c06b4420
[Fri Aug 20 16:24:14 2021] RBP: a0d028423974 R08: 0001 R09: 
0004
[Fri Aug 20 16:24:14 2021] R10:  R11:  R12: 
a0d028425000
[Fri Aug 20 16:24:14 2021] R13: 0a43a2f2 R14: a0d028425770 R15: 
0a43a2f2
[Fri Aug 20 16:24:14 2021] FS:  () 
GS:a0d03ed0() knlGS:
[Fri Aug 20 16:24:14 2021] CS:  0010 DS:  ES:  CR0: 80050033
[Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650 CR3: 02696000 CR4: 
000406e0
[Fri Aug 20 16:24:14 2021] Call Trace:
[Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]
[Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
[Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
[Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
[Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
[Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
[Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
[Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
[Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
[Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30
[Fri Aug 20 16:24:14 2021] Modules linked in: rdma_ucm ib_uverbs rdma_cm iw_cm 
ib_cm ib_core ocfs2_stack_user ocfs2 ocfs2_nodemanager ocfs2_stackglue 
quota_tree dlm af_packet iscsi_ibft iscsi_boot_sysfs rfkill intel_rapl_msr 
hid_generic intel_rapl_common usbhid virtio_net pcspkr joydev net_failover 
virtio_balloon i2c_piix4 failover tiny_power_button button fuse configfs 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ata_generic uhci_hcd ehci_pci 
ehci_hcd cirrus drm_kms_helper aesni_intel usbcore crypto_simd syscopyarea 
sysfillrect sysimgblt fb_sys_fops cec cryptd rc_core drm serio_raw i6300esb 
virtio_blk ata_piix floppy qemu_fw_cfg btrfs blake2b_generic libcrc32c 
crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua virtio_rng
[Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650
[Fri Aug 20 16:24:14 2021] ---[ end trace 2ddfa38b9d824d93 ]---
[Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
[Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 
c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 
f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
[Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282
[Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX: 

Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-13 Thread Gang He

Hi David,

On 2021/8/13 1:45, David Teigland wrote:

On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there is
not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function
again? maybe the function will fails again, that will lead to kernel
soft-lockup after multiple re-tries.


What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request.  Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource.  This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation.  This is expected.
These dlm_lock and dlm_unlock are invoked in the same node, or the 
different nodes?




A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.
If I retry dlm_lock() repeatedly, I just wonder if this will lead to 
kernel soft lockup or waste lots of CPU.

If dlm_lock() function returns -EAGAIN, how should we handle this case?
retry it repeatedly?

Thanks
Gang



Dave





Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-12 Thread Gang He

Hi Alexander,


On 2021/8/12 4:35, Alexander Aring wrote:

Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He  wrote:


Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?


What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?

ocfs2 file system.



I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
validate_lock_args -16 10 10 10c 2 0 M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error 
-16 while calling ocfs2_dlm_lock on resource M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap


The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there 
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function 
again? maybe the function will fails again, that will lead to kernel 
soft-lockup after multiple re-tries.


Thanks
Gang



Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886





[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-11 Thread Gang He
Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure. 
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?

Thanks
Gang

 




Re: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?

2019-10-11 Thread Gang He
Hello hayes,

> -Original Message-
> From: cluster-devel-boun...@redhat.com
> [mailto:cluster-devel-boun...@redhat.com] On Behalf Of Hayes, Bill
> Sent: 2019年10月11日 0:42
> To: ocfs2-de...@oss.oracle.com; cluster-devel@redhat.com
> Cc: Rocky (The good-looking one) Craig 
> Subject: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?
> 
> We have been experimenting with distributed file systems across multiple
> Linux instances connected to a shared block device.  In our setup, the "disk" 
> is
> not a legacy SAN or iSCSI.  Instead it is a shared memory-semantic fabric
> that is being presented as a Linux block device.
> 
> We have been working with both GFS2 and OCFS2 to evaluate the suitability
> to work on our shared memory configuration.  Right now we have gotten
> both GFS2 and OCFS2 to work with block driver but each file system still does
> block copies.  Our goal is to extend mmap() of the file system(s) to allow 
> true
> zero-copy load/store access directly to the memory fabric.  We believe
> adding DAX support into the OCFS2 and/or GFS2 is an expedient path to use a
> block device that fronts our memory fabric with DAX.
> 
> Based on the HW that OCFS2 and GFS2 were built for (iSCSI, FC, DRDB, etc)
> there probably has been no reason to implement DAX to date.  The advent of
> various memory semantic fabrics (Gen-Z, NUMAlink, etc) is driving our
> interest in extending OCFS2 and/or GFS2 to take advantage of DAX.  We
> have two platforms set up, one based on actual hardware and another based
> on VMs and are eager to begin deeper work.
> 
> Has there been any discussion or interest in DAX support in OCFS2?
No, but I think this is very interesting topic/feature.
I hope we can take some efforts in investigating how to make OCFS2 support DAX, 
since some local file systems have supported this feature for long time.

> Is there interest from the OCFS2 development community to see DAX support
> developed and put upstream?
>From my personal view, it is very attractive.
But we also aware cluster file systems are usually based on DLM, DLM usually 
communicates with each other via the network.
That means network latency should be considered.

Thanks
Gang

> 
> Has there been any discussion or interest in DAX support in GFS2?
> Is there interest from the GFS2 development community to see DAX support
> developed and put upstream?
> 
> Regards,
> Bill
> 




[Cluster-devel] [PATCH] dlm: remove O_NONBLOCK flag in sctp_connect_to_sock

2018-05-28 Thread Gang He
We should remove O_NONBLOCK flag when calling sock->ops->connect()
in sctp_connect_to_sock() function.
Why?
1. up to now, sctp socket connect() function ignores the flag argument,
that means O_NONBLOCK flag does not take effect, then we should remove
it to avoid the confusion (but is not urgent).
2. for the future, there will be a patch to fix this problem, then the flag
argument will take effect, the patch has been queued at https://git.kernel.o
rg/pub/scm/linux/kernel/git/davem/net.git/commit/net/sctp?id=644fbdeacf1d3ed
d366e44b8ba214de9d1dd66a9.
But, the O_NONBLOCK flag will make sock->ops->connect() directly return
without any wait time, then the connection will not be established, DLM kernel
module will call sock->ops->connect() again and again, the bad results are,
CPU usage is almost 100%, even trigger soft_lockup problem if the related
configurations are enabled,
DLM kernel module also prints lots of messages like,
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
The upper application (e.g. ocfs2 mount command) is hanged at new_lockspace(),
the whole backtrace is as below,
tb0307-nd2:~ # cat /proc/2935/stack
[<0>] new_lockspace+0x957/0xac0 [dlm]
[<0>] dlm_new_lockspace+0xae/0x140 [dlm]
[<0>] user_cluster_connect+0xc3/0x3a0 [ocfs2_stack_user]
[<0>] ocfs2_cluster_connect+0x144/0x220 [ocfs2_stackglue]
[<0>] ocfs2_dlm_init+0x215/0x440 [ocfs2]
[<0>] ocfs2_fill_super+0xcb0/0x1290 [ocfs2]
[<0>] mount_bdev+0x173/0x1b0
[<0>] mount_fs+0x35/0x150
[<0>] vfs_kern_mount.part.23+0x54/0x100
[<0>] do_mount+0x59a/0xc40
[<0>] SyS_mount+0x80/0xd0
[<0>] do_syscall_64+0x76/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[<0>] 0x

So, I think we should remove O_NONBLOCK flag here, since DLM kernel module can
not handle non-block sockect in connect() properly.

Signed-off-by: Gang He 
---
 fs/dlm/lowcomms.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index d31e9abfb9f1..a5e4a221435c 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1092,7 +1092,7 @@ static void sctp_connect_to_sock(struct connection *con)
kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *),
  sizeof(tv));
result = sock->ops->connect(sock, (struct sockaddr *), addr_len,
-  O_NONBLOCK);
+  0);
memset(, 0, sizeof(tv));
kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *),
  sizeof(tv));
-- 
2.12.3



[Cluster-devel] [PATCH] dlm: fix a clerical error when set SCTP_NODELAY

2018-05-01 Thread Gang He
There is a clerical error when turn off Nagle's algorithm in
sctp_connect_to_sock() function, this results in turn off
Nagle's algorithm failure.
After this correction, DLM performance will be improved obviously
when using SCTP procotol.

Signed-off-by: Gang He <g...@suse.com>
Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 fs/dlm/lowcomms.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 5243989..8151252 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1080,7 +1080,7 @@ static void sctp_connect_to_sock(struct connection *con)
log_print("connecting to %d", con->nodeid);
 
/* Turn off Nagle's algorithm */
-   kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *),
+   kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *),
  sizeof(one));
 
result = sock->ops->connect(sock, (struct sockaddr *), addr_len,
-- 
1.8.5.6



[Cluster-devel] [PATCH] dlm: make sctp_connect_to_sock() return in specified time

2018-04-26 Thread Gang He
When the user setup a two-ring cluster, DLM kernel module
will automatically selects to use SCTP protocol to communicate
between each node. There will be about 5 minute hang in DLM
kernel module, in case one ring is broken before switching to
another ring, this will potentially affect the dependent upper
applications, e.g. ocfs2, gfs2, clvm and clustered-MD, etc.
Unfortunately, if the user setup a two-ring cluster, we can not
specify DLM communication protocol with TCP explicitly, since
DLM kernel module only supports SCTP protocol for multiple
ring cluster.
Base on my investigation, the time is spent in sock->ops->connect()
function before returns ETIMEDOUT(-110) error, since O_NONBLOCK
argument in connect() function does not work here, then we should
make sock->ops->connect() function return in specified time via
setting socket SO_SNDTIMEO atrribute.

Signed-off-by: Gang He <g...@suse.com>
---
 fs/dlm/lowcomms.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 5243989..b786acc 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1037,6 +1037,7 @@ static void sctp_connect_to_sock(struct connection *con)
int result;
int addr_len;
struct socket *sock;
+   struct timeval tv = { .tv_sec = 5, .tv_usec = 0 };
 
if (con->nodeid == 0) {
log_print("attempt to connect sock 0 foiled");
@@ -1083,8 +1084,19 @@ static void sctp_connect_to_sock(struct connection *con)
kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *),
  sizeof(one));
 
+   /*
+* Make sock->ops->connect() function return in specified time,
+* since O_NONBLOCK argument in connect() function does not work here,
+* then, we should restore the default value of this attribute.
+*/
+   kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *),
+ sizeof(tv));
result = sock->ops->connect(sock, (struct sockaddr *), addr_len,
   O_NONBLOCK);
+   memset(, 0, sizeof(tv));
+   kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *),
+ sizeof(tv));
+
if (result == -EINPROGRESS)
result = 0;
if (result == 0)
-- 
1.8.5.6



Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental

2018-04-09 Thread Gang He
Hi Steven and David,


>>> 
> Hi,
> 
> 
> On 09/04/18 06:02, Gang He wrote:
>> Hello David,
>>
>> If the user sets "protocol=tcp" in the configuration file /etc/dlm/dlm.conf 
> under two-rings cluster environment,
>> DLM kernel module will not work with the below error message,
>> [   43.696924] DLM installed
>> [  149.552039] ocfs2: Registered cluster interface user
>> [  149.559579] dlm: TCP protocol can't handle multi-homed hosts, try SCTP  
> <<== here, failed
>> [  149.559589] dlm: cannot start dlm lowcomms -22
>> [  149.559612] (mount.ocfs2,2593,3):ocfs2_dlm_init:3120 ERROR: status = -22
>> [  149.559629] (mount.ocfs2,2593,3):ocfs2_mount_volume:1845 ERROR: status = 
> -22
>>
>> Then, could we modify the code, let this case still work via only using one 
> ring address? or the code is written by purpose.
>> in lowcomms.c
>> 1358 static int tcp_listen_for_all(void)
>> 1359 {
>> 1360 struct socket *sock = NULL;
>> 1361 struct connection *con = nodeid2con(0, GFP_NOFS);
>> 1362 int result = -EINVAL;
>> 1363
>> 1364 if (!con)
>> 1365 return -ENOMEM;
>> 1366
>> 1367 /* We don't support multi-homed hosts */
>> 1368 if (dlm_local_addr[1] != NULL) {   <<== here, could we get ride 
> of this limitation?
>> 1369 log_print("TCP protocol can't handle multi-homed hosts, 
> "
>> 1370   "try SCTP");
>> 1371 return -EINVAL;
>> 1372 }
>> 1373
>> 1374 log_print("Using TCP for communications");
>> 1375
>> 1376 sock = tcp_create_listen_sock(con, dlm_local_addr[0]);
>> 1377 if (sock) {
>> 1378 add_sock(sock, con);
>> 1379 result = 0;
>> 1380 }
>> 1381 else {
>> 1382 result = -EADDRINUSE;
>> 1383 }
>>
>>
>> Thanks
>> Gang
>>
> There is already a patch set to allow multi-homing for TCP. Mark and 
> Dave can comment on the current status and how far from merging it 
> currently is,
Thanks for your update on this problem, hopefully we can the related patches in 
the Linus git tree soon.

Thanks
Gang

> 
> Steve.




Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental

2018-04-08 Thread Gang He
Hello David,

If the user sets "protocol=tcp" in the configuration file /etc/dlm/dlm.conf 
under two-rings cluster environment,
DLM kernel module will not work with the below error message,
[   43.696924] DLM installed
[  149.552039] ocfs2: Registered cluster interface user
[  149.559579] dlm: TCP protocol can't handle multi-homed hosts, try SCTP  <<== 
here, failed
[  149.559589] dlm: cannot start dlm lowcomms -22
[  149.559612] (mount.ocfs2,2593,3):ocfs2_dlm_init:3120 ERROR: status = -22
[  149.559629] (mount.ocfs2,2593,3):ocfs2_mount_volume:1845 ERROR: status = -22 

Then, could we modify the code, let this case still work via only using one 
ring address? or the code is written by purpose.
in lowcomms.c
1358 static int tcp_listen_for_all(void)
1359 {
1360 struct socket *sock = NULL;
1361 struct connection *con = nodeid2con(0, GFP_NOFS);
1362 int result = -EINVAL;
1363
1364 if (!con)
1365 return -ENOMEM;
1366
1367 /* We don't support multi-homed hosts */
1368 if (dlm_local_addr[1] != NULL) {   <<== here, could we get ride of 
this limitation? 
1369 log_print("TCP protocol can't handle multi-homed hosts, "
1370   "try SCTP");
1371 return -EINVAL;
1372 }
1373
1374 log_print("Using TCP for communications");
1375
1376 sock = tcp_create_listen_sock(con, dlm_local_addr[0]);
1377 if (sock) {
1378 add_sock(sock, con);
1379 result = 0;
1380 }
1381 else {
1382 result = -EADDRINUSE;
1383 }


Thanks
Gang


>>> 
> On Mon, Apr 02, 2018 at 08:01:24PM -0600, Gang He wrote:
>> OK, I got your point.
>> But, could we have a appropriate way to let the users know SCTP protocol 
> status?
> 
> I think this is a case where suse/rh/etc need to have their own
> distro-specific approaches for specifying the usage parameters that they
> have tested and found to be supportable.  Other companies have previously
> found their specific use of SCTP to be acceptable.  RH does not properly
> support dlm+SCTP for similar reasons as you've found, although I've more
> recently encouraged customers to try dlm+SCTP with a single path in order
> debug or diagnose potential networking issues.




Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental

2018-04-02 Thread Gang He
Hi David,



>>> 
> On Thu, Mar 22, 2018 at 10:27:56PM -0600, Gang He wrote:
>> Hello David,
>> 
>> Do you agree to add this prompt to the user? 
>> Since sometimes customers attempted to setup SCTP protocol with two rings, 
>> but they could not get the expected result, then it maybe bring some 
> concerns to the customer for DLM qualities.
> 
> I don't think the kernel message is a good way to communicate this to users.
> Dave
OK, I got your point.
But, could we have a appropriate way to let the users know SCTP protocol status?

Thanks
Gang

> 
> 
>> > As you know, DLM module can use TCP or SCTP protocols to
>> > communicate among the cluster.
>> > But, according to our testing, SCTP protocol is still considered
>> > experimental, since not all aspects are working correctly and
>> > it is not full tested.
>> > e.g. SCTP connection channel switch needs about 5mins hang in case
>> > one connection(ring) is broken.
>> > Then, I suggest to add a kernel print, which prompts the user SCTP
>> > protocol for DLM should be considered experimental, it is not
>> > recommended in production environment.
>> > 
>> > Signed-off-by: Gang He <g...@suse.com>
>> > ---
>> >  fs/dlm/lowcomms.c | 1 +
>> >  1 file changed, 1 insertion(+)
>> > 
>> > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
>> > index cff79ea..18fd85d 100644
>> > --- a/fs/dlm/lowcomms.c
>> > +++ b/fs/dlm/lowcomms.c
>> > @@ -1307,6 +1307,7 @@ static int sctp_listen_for_all(void)
>> >return -ENOMEM;
>> >  
>> >log_print("Using SCTP for communications");
>> > +  log_print("SCTP protocol is experimental, use at your own risk");
>> >  
>> >result = sock_create_kern(_net, dlm_local_addr[0]->ss_family,
>> >  SOCK_STREAM, IPPROTO_SCTP, );
>> > -- 
>> > 1.8.5.6




Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental

2018-03-22 Thread Gang He
Hello David,

Do you agree to add this prompt to the user? 
Since sometimes customers attempted to setup SCTP protocol with two rings, 
but they could not get the expected result, then it maybe bring some concerns 
to the customer for DLM qualities.


Thanks
Gang


>>> 
> As you know, DLM module can use TCP or SCTP protocols to
> communicate among the cluster.
> But, according to our testing, SCTP protocol is still considered
> experimental, since not all aspects are working correctly and
> it is not full tested.
> e.g. SCTP connection channel switch needs about 5mins hang in case
> one connection(ring) is broken.
> Then, I suggest to add a kernel print, which prompts the user SCTP
> protocol for DLM should be considered experimental, it is not
> recommended in production environment.
> 
> Signed-off-by: Gang He <g...@suse.com>
> ---
>  fs/dlm/lowcomms.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index cff79ea..18fd85d 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -1307,6 +1307,7 @@ static int sctp_listen_for_all(void)
>   return -ENOMEM;
>  
>   log_print("Using SCTP for communications");
> + log_print("SCTP protocol is experimental, use at your own risk");
>  
>   result = sock_create_kern(_net, dlm_local_addr[0]->ss_family,
> SOCK_STREAM, IPPROTO_SCTP, );
> -- 
> 1.8.5.6




Re: [Cluster-devel] [ClusterLabs] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
Hi Feldhost,

I use active rrp_mode in corosync.conf and reboot the cluster to let the 
configuration effective.
But, the about 5 mins hang in new_lockspace() function is still here.

Thanks
Gang
 

>>> 
> Hi, so try to use active mode.
> 
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_installatio 
> n_terms.html
> 
> That fixes I saw in 4.14.*
> 
>> On 8 Mar 2018, at 09:12, Gang He <g...@suse.com> wrote:
>> 
>> Hi Feldhost,
>> 
>> 
>>>>> 
>>> Hello Gang He,
>>> 
>>> which type of corosync rrp_mode you use? Passive or Active? 
>> clvm1:/etc/corosync # cat corosync.conf  | grep rrp_mode
>>rrp_mode:   passive
>> 
>> Did you try test both?
>> No, only this mode. 
>> Also, what kernel version you use? I see some SCTP fixes in latest kernels.
>> clvm1:/etc/corosync # uname -r
>> 4.4.114-94.11-default
>> It looks that sock->ops->connect() function is blocked for too long time 
>> before 
> return, under broken network situation. 
>> In normal network, sock->ops->connect() function returns very quickly.
>> 
>> Thanks
>> Gang
>> 
>>> 
>>>> On 8 Mar 2018, at 08:52, Gang He <g...@suse.com> wrote:
>>>> 
>>>> Hello list and David Teigland,
>>>> 
>>>> I got a problem under a two rings cluster, the problem can be reproduced 
>>> with the below steps.
>>>> 1) setup a two rings cluster with two nodes.
>>>> e.g. 
>>>> clvm1(nodeid 172204569)  addr_list eth0 10.67.162.25 eth1 192.168.152.240
>>>> clvm2(nodeid 172204570)  addr_list eth0 10.67.162.26 eth1 192.168.152.103
>>>> 
>>>> 2) the whole cluster works well, then I put eth0 down on node clvm2, and 
>>> restart pacemaker service on that node.
>>>> ifconfig eth0 down
>>>> rcpacemaker restart
>>>> 
>>>> 3) the whole cluster still work well (that means corosync is very smooth 
>>>> to 
>>> switch to the other ring).
>>>> Then, I can mount ocfs2 file system on node clvm2 quickly with the command 
>>>> mount /dev/sda /mnt/ocfs2 
>>>> 
>>>> 4) Next, I do the same mount on node clvm1, the mount command will be 
>>>> hanged 
> 
>>> for about 5 mins, and finally the mount command is done.
>>>> But, if we setup a ocfs2 file system resource in pacemaker,
>>>> the pacemaker resource agent will consider ocfs2 file system resource 
>>> startup failure before this command returns,
>>>> the pacemaker will fence node clvm1. 
>>>> This problem is impacting our customer's estimate, since they think the 
>>>> two 
>>> rings can be switched smoothly.
>>>> 
>>>> According to this problem, I can see the mount command is hanged with the 
>>> below back trace,
>>>> clvm1:/ # cat /proc/6688/stack
>>>> [] new_lockspace+0x92d/0xa70 [dlm]
>>>> [] dlm_new_lockspace+0x69/0x160 [dlm]
>>>> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
>>>> [] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
>>>> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>>> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>>> [] mount_bdev+0x1a0/0x1e0
>>>> [] mount_fs+0x3a/0x170
>>>> [] vfs_kern_mount+0x62/0x110
>>>> [] do_mount+0x213/0xcd0
>>>> [] SyS_mount+0x85/0xd0
>>>> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>> [] 0x
>>>> 
>>>> The root cause is in sctp_connect_to_sock() function in lowcomms.c,
>>>> 1075
>>>> 1076 log_print("connecting to %d", con->nodeid);
>>>> 1077
>>>> 1078 /* Turn off Nagle's algorithm */
>>>> 1079 kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *),
>>>> 1080   sizeof(one));
>>>> 1081
>>>> 1082 result = sock->ops->connect(sock, (struct sockaddr *), 
>>> addr_len,
>>>> 1083O_NONBLOCK);  <<= here, this 
>>>> invoking 
>>> will cost > 5 mins before return ETIMEDOUT(-110).
>>>> 1084 printk(KERN_ERR "sctp_connect_to_sock connect: %d\n", result);
>>>> 1085
>>>> 1086 if (result == -EINPROGRESS)
>>>> 1087 result = 0;
>>>> 1088 if (result == 0)
>>>> 1089 goto out;
>>>> 
>>>> Then, I want to know if this problem was found/fixed before? 
>>>> it looks DLM can not switch the second ring very quickly, this will impact 
>>> the above application (e.g. CLVM, ocfs2) to create a new lock space before 
>>> it's startup.
>>>> 
>>>> Thanks
>>>> Gang
>>>> 
>>>> 
>>>> ___
>>>> Users mailing list: us...@clusterlabs.org 
>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org



Re: [Cluster-devel] [ClusterLabs] DLM connection channel switch take too long time (> 5mins)

2018-03-08 Thread Gang He
Hi Feldhost,


>>> 
> Hello Gang He,
> 
> which type of corosync rrp_mode you use? Passive or Active? 
clvm1:/etc/corosync # cat corosync.conf  | grep rrp_mode
rrp_mode:   passive

Did you try test both?
No, only this mode. 
Also, what kernel version you use? I see some SCTP fixes in latest kernels.
clvm1:/etc/corosync # uname -r
4.4.114-94.11-default
It looks that sock->ops->connect() function is blocked for too long time before 
return, under broken network situation. 
In normal network, sock->ops->connect() function returns very quickly.

Thanks
Gang

> 
>> On 8 Mar 2018, at 08:52, Gang He <g...@suse.com> wrote:
>> 
>> Hello list and David Teigland,
>> 
>> I got a problem under a two rings cluster, the problem can be reproduced 
> with the below steps.
>> 1) setup a two rings cluster with two nodes.
>> e.g. 
>> clvm1(nodeid 172204569)  addr_list eth0 10.67.162.25 eth1 192.168.152.240
>> clvm2(nodeid 172204570)  addr_list eth0 10.67.162.26 eth1 192.168.152.103
>> 
>> 2) the whole cluster works well, then I put eth0 down on node clvm2, and 
> restart pacemaker service on that node.
>> ifconfig eth0 down
>> rcpacemaker restart
>> 
>> 3) the whole cluster still work well (that means corosync is very smooth to 
> switch to the other ring).
>> Then, I can mount ocfs2 file system on node clvm2 quickly with the command 
>> mount /dev/sda /mnt/ocfs2 
>> 
>> 4) Next, I do the same mount on node clvm1, the mount command will be hanged 
> for about 5 mins, and finally the mount command is done.
>> But, if we setup a ocfs2 file system resource in pacemaker,
>> the pacemaker resource agent will consider ocfs2 file system resource 
> startup failure before this command returns,
>> the pacemaker will fence node clvm1. 
>> This problem is impacting our customer's estimate, since they think the two 
> rings can be switched smoothly.
>> 
>> According to this problem, I can see the mount command is hanged with the 
> below back trace,
>> clvm1:/ # cat /proc/6688/stack
>> [] new_lockspace+0x92d/0xa70 [dlm]
>> [] dlm_new_lockspace+0x69/0x160 [dlm]
>> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
>> [] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
>> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>> [] mount_bdev+0x1a0/0x1e0
>> [] mount_fs+0x3a/0x170
>> [] vfs_kern_mount+0x62/0x110
>> [] do_mount+0x213/0xcd0
>> [] SyS_mount+0x85/0xd0
>> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
>> [] 0x
>> 
>> The root cause is in sctp_connect_to_sock() function in lowcomms.c,
>> 1075
>> 1076 log_print("connecting to %d", con->nodeid);
>> 1077
>> 1078 /* Turn off Nagle's algorithm */
>> 1079 kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *),
>> 1080   sizeof(one));
>> 1081
>> 1082 result = sock->ops->connect(sock, (struct sockaddr *), 
> addr_len,
>> 1083O_NONBLOCK);  <<= here, this 
>> invoking 
> will cost > 5 mins before return ETIMEDOUT(-110).
>> 1084 printk(KERN_ERR "sctp_connect_to_sock connect: %d\n", result);
>> 1085
>> 1086 if (result == -EINPROGRESS)
>> 1087 result = 0;
>> 1088 if (result == 0)
>> 1089 goto out;
>> 
>> Then, I want to know if this problem was found/fixed before? 
>> it looks DLM can not switch the second ring very quickly, this will impact 
> the above application (e.g. CLVM, ocfs2) to create a new lock space before 
> it's startup.
>> 
>> Thanks
>> Gang
>> 
>> 
>> ___
>> Users mailing list: us...@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org



[Cluster-devel] [PATCH v2] dlm: Make dismatch error message more clear

2017-05-17 Thread Gang He
This change will try to make this error message more clear,
since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace
to create a new lockspace with passing a cluster name. Sometimes,
dlm_new_lockspace return failure while two cluster names dismatch,
the user is a little confused since this line error message is not
enough obvious.

Signed-off-by: Gang He <g...@suse.com>
---
 fs/dlm/lockspace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index 91592b7..b03d808 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -455,7 +455,8 @@ static int new_lockspace(const char *name, const char 
*cluster,
 
if (dlm_config.ci_recover_callbacks && cluster &&
strncmp(cluster, dlm_config.ci_cluster_name, DLM_LOCKSPACE_LEN)) {
-   log_print("dlm cluster name %s mismatch %s",
+   log_print("dlm cluster name '%s' does not match "
+ "the application cluster name '%s'",
  dlm_config.ci_cluster_name, cluster);
error = -EBADR;
goto out;
-- 
1.8.5.6



[Cluster-devel] [PATCH] dlm: Make dismatch error message more clear

2017-05-16 Thread Gang He
This change will try to make this error message more clear,
since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace
to create a new lockspace with passing a cluster name. Sometimes,
dlm_new_lockspace return failure while two cluster names dismatch,
the user is a little confused since this line error message is not
enough obvious.

Signed-off-by: Gang He <g...@suse.com>
---
 fs/dlm/lockspace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index 91592b7..b03d808 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -455,7 +455,8 @@ static int new_lockspace(const char *name, const char 
*cluster,
 
if (dlm_config.ci_recover_callbacks && cluster &&
strncmp(cluster, dlm_config.ci_cluster_name, DLM_LOCKSPACE_LEN)) {
-   log_print("dlm cluster name %s mismatch %s",
+   log_print("dlm configured cluster name '%s' does not match "
+ "the passed cluster name '%s'",
  dlm_config.ci_cluster_name, cluster);
error = -EBADR;
goto out;
-- 
1.8.5.6



Re: [Cluster-devel] GFS2 file system does not invalidate page cache after direct IO write

2017-05-04 Thread Gang He
Hello Andreas,


>>> 
> Gang,
> 
> On Thu, May 4, 2017 at 5:33 AM, Gang He <g...@suse.com> wrote:
>> Hello Guys,
>>
>> I found a interesting thing on GFS2 file system, After I did a direct IO 
> write for a whole file, I still saw there were some page caches in this 
> inode.
>> It looks this GFS2 behavior does not follow file system POSIX semantics, I 
> just want to know this problem belongs to a know issue or we can fix it?
>> By the way, I did the same testing on EXT4 and OCFS2 file systems, the 
> result looks OK.
>> I will paste my testing command lines and outputs as below,
>>
>> For EXT4 file system,
>> tb-nd1:/mnt/ext4 # rm -rf f3
>> tb-nd1:/mnt/ext4 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
>> 4+0 records in
>> 4+0 records out
>> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0393563 s, 107 MB/s
>> tb-nd1:/mnt/ext4 # vmtouch -v f3
>> f3
>> [ ] 0/1024
>>
>>Files: 1
>>  Directories: 0
>>   Resident Pages: 0/1024  0/4M  0%
>>  Elapsed: 0.000424 seconds
>> tb-nd1:/mnt/ext4 #
>>
>> For OCFS2 file system,
>> tb-nd1:/mnt/ocfs2 # rm -rf f3
>> tb-nd1:/mnt/ocfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
>> 4+0 records in
>> 4+0 records out
>> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0592058 s, 70.8 MB/s
>> tb-nd1:/mnt/ocfs2 # vmtouch -v f3
>> f3
>> [ ] 0/1024
>>
>>Files: 1
>>  Directories: 0
>>   Resident Pages: 0/1024  0/4M  0%
>>  Elapsed: 0.000226 seconds
>>
>> For GFS2 file system,
>> tb-nd1:/mnt/gfs2 # rm -rf f3
>> tb-nd1:/mnt/gfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
>> 4+0 records in
>> 4+0 records out
>> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0579509 s, 72.4 MB/s
>> tb-nd1:/mnt/gfs2 # vmtouch -v f3
>> f3
>> [ oo oOo  ] 48/1024
> 
> I cannot reproduce, at least not so easily. What kernel version is
> this? If it's not a mainline kernel, can you reproduce on mainline?
I always reproduce. I am using the kernel version 4.11.0-rc4-2-default, 
although the version is not latest,
it is enough new.
By the way, I add some printk in GFS2 and OCFS2 kernel module, I find GFS2 
direct-IO always falls back to buffered IO, I am not sure this behavior is 
by-design. 
Of source, even GFS2 falls back to buffered IO, the code still make sure the 
related page cache invalidated, but the testing result is not by-expected, I 
need to look at the code deeply.
the printk outputs like,
[  198.176774] gfs2_file_write_iter: enter ino 132419 0 - 1048576
[  198.176785] gfs2_direct_IO: enter ino 132419 pages 0 0 - 1048576
[  198.176787] gfs2_direct_IO: exit ino 132419 - (0)   <<== here, 
gfs2_direct_IO always return 0, then fall back to buffered IO, his behavior is 
by-design?
[  198.184640] gfs2_file_write_iter: exit ino 132419 - (1048576) <<== The 
write_iter looks to return the right bytes.
[  198.189151] gfs2_file_write_iter: enter ino 132419 1048576 - 1048576
[  198.189163] gfs2_direct_IO: enter ino 132419 pages 8 1048576 - 1048576 <<== 
here, the inode's page number is greater than zero.
[  198.189165] gfs2_direct_IO: exit ino 132419 - (0)
[  198.195901] gfs2_file_write_iter: exit ino 132419 - (1048576)
But for OCFS2
[  120.331053] ocfs2_file_write_iter: enter ino 297475 0 - 1048576
[  120.331065] ocfs2_direct_IO: enter ino 297475 pages 0 0 - 1048576
[  120.343129] ocfs2_direct_IO: exit ino 297475 (1048576) <<== here, 
ocfs2_direct_IO can return the right bytes.
[  120.343132] ocfs2_file_write_iter: exit ino 297475 - (1048576)
[  120.347705] ocfs2_file_write_iter: enter ino 297475 1048576 - 1048576
[  120.347713] ocfs2_direct_IO: enter ino 297475 pages 0 1048576 - 1048576  
<<== here, the inode's page number is always zero.
[  120.354096] ocfs2_direct_IO: exit ino 297475 (1048576)
[  120.354099] ocfs2_file_write_iter: exit ino 297475 - (1048576)

Thanks
Gang
 

> 
> Thanks,
> Andreas




[Cluster-devel] GFS2 file system does not invalidate page cache after direct IO write

2017-05-03 Thread Gang He
Hello Guys,

I found a interesting thing on GFS2 file system, After I did a direct IO write 
for a whole file, I still saw there were some page caches in this inode.
It looks this GFS2 behavior does not follow file system POSIX semantics, I just 
want to know this problem belongs to a know issue or we can fix it?
By the way, I did the same testing on EXT4 and OCFS2 file systems, the result 
looks OK.
I will paste my testing command lines and outputs as below,

For EXT4 file system,
tb-nd1:/mnt/ext4 # rm -rf f3
tb-nd1:/mnt/ext4 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0393563 s, 107 MB/s
tb-nd1:/mnt/ext4 # vmtouch -v f3
f3
[ ] 0/1024

   Files: 1
 Directories: 0
  Resident Pages: 0/1024  0/4M  0%
 Elapsed: 0.000424 seconds
tb-nd1:/mnt/ext4 #

For OCFS2 file system,
tb-nd1:/mnt/ocfs2 # rm -rf f3
tb-nd1:/mnt/ocfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0592058 s, 70.8 MB/s
tb-nd1:/mnt/ocfs2 # vmtouch -v f3
f3
[ ] 0/1024

   Files: 1
 Directories: 0
  Resident Pages: 0/1024  0/4M  0%
 Elapsed: 0.000226 seconds

For GFS2 file system,
tb-nd1:/mnt/gfs2 # rm -rf f3
tb-nd1:/mnt/gfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0579509 s, 72.4 MB/s
tb-nd1:/mnt/gfs2 # vmtouch -v f3
f3
[ oo oOo  ] 48/1024

   Files: 1
 Directories: 0
  Resident Pages: 48/1024  192K/4M  4.69%
 Elapsed: 0.000287 seconds


For vmtouch tool, you can download it's source code from 
https://github.com/hoytech/vmtouch
I also printk the inode's address_space after a full file direct-IO write in 
kernel space,
the nrpages value in the inode's address_space is always greater than zero.

Thanks
Gang





[Cluster-devel] inconsistent dlm_new_lockspace LVB_LEN size from ocfs2 user-space tool and ocfs2 kernel module

2016-05-13 Thread Gang He
Hello Guys,

Here is a inconsistent LVB_LEN size problem when create a new lockspace from 
user-space tool (e.g. fsck.ocfs2) and kernel module (e.g. ocfs2/stack_user.c).
>From the userspace tool, the LVB size is DLM_USER_LVB_LEN (32 bytes, defined 
>in /include/linux/dlm_device.h)
>From the kernel module, the LVB size is DLM_LVB_LEN (64 bytes).
Why did we design like this? Look at GFS2 kernel module code, it uses 32 bytes 
as LVB_LEN size, it is the same size with  DLM_USER_LVB_LEN macro definition. 
Now, We encountered a customer issue, the user did a fsck on a ocfs2 file 
system from one node, but aborted without release this lockspace (32bytes), 
then the user mounted this file system.
The kernel module would use the existing same lockspace, without creating the 
new lockspace with 64 bytes LVB_LEN.
 Next, the bad result was that the user could not mount this file system from 
the other nodes no longer.
The error messages likes,
Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.430947] dlm: 
032F55597DEA4A61AB065568F964174D: config mismatch: 64,0 nodeid 177127961: 32,0
Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433267] 
(mount.ocfs2,26981,46):ocfs2_dlm_init:2995 ERROR: status = -71
Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433325] 
(mount.ocfs2,26981,46):ocfs2_mount_volume:1881 ERROR: status = -71
Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433376] 
(mount.ocfs2,26981,46):ocfs2_fill_super:1236 ERROR: status = -71
Apr 26 16:29:16 mapkhpch1bl02 Filesystem(MITC_Pool1)[26912]: ERROR: Couldn't 
mount filesystem /dev/disk/by-id/scsi-3600507640081010d5082 on 
/MITC_Pool1

Of course, the urgent fix is easy, we can reboot all the nodes, then mount the 
file system again.
But, I want to if there were some reasons about this design, otherwise, I want 
to see if we can use the same size between user space and kernel module.


Thanks
Gang