Re: [Cluster-devel] FS/DLM module triggered kernel BUG
On 2021/8/23 21:49, Alexander Aring wrote: Hi Gang He, On Mon, Aug 23, 2021 at 1:43 AM Gang He wrote: Hello Guys, I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG. What do you exactly do? I would like to test it on a recent upstream version, or you can do it? I am not specifically to test the dlm kernel module. I am doing ocfs2 related testing with opensuse Tumbleweed, which includes a very new kernel version. But sometimes the ocfs2 test cases were blocked/aborted, due to this DLM problem. Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not. could be, see below. The backtrace is as below, [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes [Fri Aug 20 16:24:14 2021] dlm: connection 5ef82293 got EOF from 172204615 here we disconnect from nodeid 172204615. [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: dd99ffd16650 [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0 [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G W 5.13.8-1-default #1 openSUSE Tumbleweed [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm] [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20 [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55 [Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282 [Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX: 0830 [Fri Aug 20 16:24:14 2021] RDX: RSI: 0f48 RDI: c06b4420 [Fri Aug 20 16:24:14 2021] RBP: a0d028423974 R08: 0001 R09: 0004 [Fri Aug 20 16:24:14 2021] R10: R11: R12: a0d028425000 [Fri Aug 20 16:24:14 2021] R13: 0a43a2f2 R14: a0d028425770 R15: 0a43a2f2 [Fri Aug 20 16:24:14 2021] FS: () GS:a0d03ed0() knlGS: [Fri Aug 20 16:24:14 2021] CS: 0010 DS: ES: CR0: 80050033 [Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650 CR3: 02696000 CR4: 000406e0 [Fri Aug 20 16:24:14 2021] Call Trace: [Fri Aug 20 16:24:14 2021] dlm_receive_buffer+0x66/0x150 [dlm] It would be interesting if we got here some message from nodeid 172204615 and I think this is what happens. There is maybe some use after free going on and we should not receive anymore messages from nodeid 172204615. I recently added some dlm tracing infrastructure. It should be simple to add a trace event here, print out the nodeid and compare timestamps. I recently fixed a synchronization issue which is not part of kernel 5.13.8 and has something to do with what you are seeing here. There exists a workaround or a simple test if this really affects you, simply create a dummy lockspace on all nodes so we actually never do any disconnects and look if you are running again into this issue. What is this git commit? I do not want to see any kernel (warning) print about DLM kernel module. Sometimes, DLM would enter a stuck state after the DLM kernel print. Since there were a few commits in the past weeks, I just wonder if there is any regression problem. Thanks Gang [Fri Aug 20 16:24:14 2021] dlm_process_incoming_buffer+0x38/0x90 [dlm] [Fri Aug 20 16:24:14 2021] receive_from_sock+0xd4/0x1f0 [dlm] [Fri Aug 20 16:24:14 2021] process_recv_sockets+0x1a/0x20 [dlm] [Fri Aug 20 16:24:14 2021] process_one_work+0x1df/0x370 [Fri Aug 20 16:24:14 2021] worker_thread+0x50/0x400 [Fri Aug 20 16:24:14 2021] ? process_on
[Cluster-devel] FS/DLM module triggered kernel BUG
Hello Guys, I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG. Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not. The backtrace is as below, [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes [Fri Aug 20 16:24:14 2021] dlm: connection 5ef82293 got EOF from 172204615 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599 [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: dd99ffd16650 [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0 [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G W 5.13.8-1-default #1 openSUSE Tumbleweed [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm] [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20 [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55 [Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282 [Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX: 0830 [Fri Aug 20 16:24:14 2021] RDX: RSI: 0f48 RDI: c06b4420 [Fri Aug 20 16:24:14 2021] RBP: a0d028423974 R08: 0001 R09: 0004 [Fri Aug 20 16:24:14 2021] R10: R11: R12: a0d028425000 [Fri Aug 20 16:24:14 2021] R13: 0a43a2f2 R14: a0d028425770 R15: 0a43a2f2 [Fri Aug 20 16:24:14 2021] FS: () GS:a0d03ed0() knlGS: [Fri Aug 20 16:24:14 2021] CS: 0010 DS: ES: CR0: 80050033 [Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650 CR3: 02696000 CR4: 000406e0 [Fri Aug 20 16:24:14 2021] Call Trace: [Fri Aug 20 16:24:14 2021] dlm_receive_buffer+0x66/0x150 [dlm] [Fri Aug 20 16:24:14 2021] dlm_process_incoming_buffer+0x38/0x90 [dlm] [Fri Aug 20 16:24:14 2021] receive_from_sock+0xd4/0x1f0 [dlm] [Fri Aug 20 16:24:14 2021] process_recv_sockets+0x1a/0x20 [dlm] [Fri Aug 20 16:24:14 2021] process_one_work+0x1df/0x370 [Fri Aug 20 16:24:14 2021] worker_thread+0x50/0x400 [Fri Aug 20 16:24:14 2021] ? process_one_work+0x370/0x370 [Fri Aug 20 16:24:14 2021] kthread+0x127/0x150 [Fri Aug 20 16:24:14 2021] ? set_kthread_struct+0x40/0x40 [Fri Aug 20 16:24:14 2021] ret_from_fork+0x22/0x30 [Fri Aug 20 16:24:14 2021] Modules linked in: rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ocfs2_stack_user ocfs2 ocfs2_nodemanager ocfs2_stackglue quota_tree dlm af_packet iscsi_ibft iscsi_boot_sysfs rfkill intel_rapl_msr hid_generic intel_rapl_common usbhid virtio_net pcspkr joydev net_failover virtio_balloon i2c_piix4 failover tiny_power_button button fuse configfs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ata_generic uhci_hcd ehci_pci ehci_hcd cirrus drm_kms_helper aesni_intel usbcore crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cec cryptd rc_core drm serio_raw i6300esb virtio_blk ata_piix floppy qemu_fw_cfg btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua virtio_rng [Fri Aug 20 16:24:14 2021] CR2: dd99ffd16650 [Fri Aug 20 16:24:14 2021] ---[ end trace 2ddfa38b9d824d93 ]--- [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20 [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55 [Fri Aug 20 16:24:14 2021] RSP: 0018:bd9a041ebd80 EFLAGS: 00010282 [Fri Aug 20 16:24:14 2021] RAX: 3cc9c100ec00 RBX: 00dc RCX:
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hi David, On 2021/8/13 1:45, David Teigland wrote: On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: In fact, I can reproduce this problem stably. I want to know if this error happen is by our expectation? since there is not any extreme pressure test. Second, how should we handle these error cases? call dlm_lock function again? maybe the function will fails again, that will lead to kernel soft-lockup after multiple re-tries. What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel an in-progress dlm_lock() request. Before the cancel completes (or the original request completes), ocfs2 calls dlm_lock() again on the same resource. This dlm_lock() returns -EBUSY because the previous request has not completed, either normally or by cancellation. This is expected. These dlm_lock and dlm_unlock are invoked in the same node, or the different nodes? A couple options to try: wait for the original request to complete (normally or by cancellation) before calling dlm_lock() again, or retry dlm_lock() on -EBUSY. If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel soft lockup or waste lots of CPU. If dlm_lock() function returns -EAGAIN, how should we handle this case? retry it repeatedly? Thanks Gang Dave
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hi Alexander, On 2021/8/12 4:35, Alexander Aring wrote: Hi, On Wed, Aug 11, 2021 at 6:41 AM Gang He wrote: Hello List, I am using kernel 5.13.4 (some old version kernels have the same problem). When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message, then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16. The function failure did not always happen, but in some case, I could encounter this failure. Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases? If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently. How should we handle such situation? call dlm_lock function to downconvert the dlm lock again? What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)? ocfs2 file system. I believe you are running into case [0]. Can you provide the corresponding log_debug() message? It's necessary to insert "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG in your kernel log then. [Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 10 [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 10 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: validate_lock_args -16 10 10 10c 2 0 M046e02 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error -16 while calling ocfs2_dlm_lock on resource M046e02 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap The whole kernel log for this node is here: https://pastebin.com/FBn8Uwsu The other two node kernel log: https://pastebin.com/XxrZw6ds https://pastebin.com/2Jw1ZqVb In fact, I can reproduce this problem stably. I want to know if this error happen is by our expectation? since there is not any extreme pressure test. Second, how should we handle these error cases? call dlm_lock function again? maybe the function will fails again, that will lead to kernel soft-lockup after multiple re-tries. Thanks Gang Thanks. - Alex [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hello List, I am using kernel 5.13.4 (some old version kernels have the same problem). When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message, then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16. The function failure did not always happen, but in some case, I could encounter this failure. Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases? If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently. How should we handle such situation? call dlm_lock function to downconvert the dlm lock again? Thanks Gang
Re: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?
Hello hayes, > -Original Message- > From: cluster-devel-boun...@redhat.com > [mailto:cluster-devel-boun...@redhat.com] On Behalf Of Hayes, Bill > Sent: 2019年10月11日 0:42 > To: ocfs2-de...@oss.oracle.com; cluster-devel@redhat.com > Cc: Rocky (The good-looking one) Craig > Subject: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2? > > We have been experimenting with distributed file systems across multiple > Linux instances connected to a shared block device. In our setup, the "disk" > is > not a legacy SAN or iSCSI. Instead it is a shared memory-semantic fabric > that is being presented as a Linux block device. > > We have been working with both GFS2 and OCFS2 to evaluate the suitability > to work on our shared memory configuration. Right now we have gotten > both GFS2 and OCFS2 to work with block driver but each file system still does > block copies. Our goal is to extend mmap() of the file system(s) to allow > true > zero-copy load/store access directly to the memory fabric. We believe > adding DAX support into the OCFS2 and/or GFS2 is an expedient path to use a > block device that fronts our memory fabric with DAX. > > Based on the HW that OCFS2 and GFS2 were built for (iSCSI, FC, DRDB, etc) > there probably has been no reason to implement DAX to date. The advent of > various memory semantic fabrics (Gen-Z, NUMAlink, etc) is driving our > interest in extending OCFS2 and/or GFS2 to take advantage of DAX. We > have two platforms set up, one based on actual hardware and another based > on VMs and are eager to begin deeper work. > > Has there been any discussion or interest in DAX support in OCFS2? No, but I think this is very interesting topic/feature. I hope we can take some efforts in investigating how to make OCFS2 support DAX, since some local file systems have supported this feature for long time. > Is there interest from the OCFS2 development community to see DAX support > developed and put upstream? >From my personal view, it is very attractive. But we also aware cluster file systems are usually based on DLM, DLM usually communicates with each other via the network. That means network latency should be considered. Thanks Gang > > Has there been any discussion or interest in DAX support in GFS2? > Is there interest from the GFS2 development community to see DAX support > developed and put upstream? > > Regards, > Bill >
[Cluster-devel] [PATCH] dlm: remove O_NONBLOCK flag in sctp_connect_to_sock
We should remove O_NONBLOCK flag when calling sock->ops->connect() in sctp_connect_to_sock() function. Why? 1. up to now, sctp socket connect() function ignores the flag argument, that means O_NONBLOCK flag does not take effect, then we should remove it to avoid the confusion (but is not urgent). 2. for the future, there will be a patch to fix this problem, then the flag argument will take effect, the patch has been queued at https://git.kernel.o rg/pub/scm/linux/kernel/git/davem/net.git/commit/net/sctp?id=644fbdeacf1d3ed d366e44b8ba214de9d1dd66a9. But, the O_NONBLOCK flag will make sock->ops->connect() directly return without any wait time, then the connection will not be established, DLM kernel module will call sock->ops->connect() again and again, the bad results are, CPU usage is almost 100%, even trigger soft_lockup problem if the related configurations are enabled, DLM kernel module also prints lots of messages like, [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 The upper application (e.g. ocfs2 mount command) is hanged at new_lockspace(), the whole backtrace is as below, tb0307-nd2:~ # cat /proc/2935/stack [<0>] new_lockspace+0x957/0xac0 [dlm] [<0>] dlm_new_lockspace+0xae/0x140 [dlm] [<0>] user_cluster_connect+0xc3/0x3a0 [ocfs2_stack_user] [<0>] ocfs2_cluster_connect+0x144/0x220 [ocfs2_stackglue] [<0>] ocfs2_dlm_init+0x215/0x440 [ocfs2] [<0>] ocfs2_fill_super+0xcb0/0x1290 [ocfs2] [<0>] mount_bdev+0x173/0x1b0 [<0>] mount_fs+0x35/0x150 [<0>] vfs_kern_mount.part.23+0x54/0x100 [<0>] do_mount+0x59a/0xc40 [<0>] SyS_mount+0x80/0xd0 [<0>] do_syscall_64+0x76/0x140 [<0>] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [<0>] 0x So, I think we should remove O_NONBLOCK flag here, since DLM kernel module can not handle non-block sockect in connect() properly. Signed-off-by: Gang He --- fs/dlm/lowcomms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index d31e9abfb9f1..a5e4a221435c 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1092,7 +1092,7 @@ static void sctp_connect_to_sock(struct connection *con) kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *), sizeof(tv)); result = sock->ops->connect(sock, (struct sockaddr *), addr_len, - O_NONBLOCK); + 0); memset(, 0, sizeof(tv)); kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *), sizeof(tv)); -- 2.12.3
[Cluster-devel] [PATCH] dlm: fix a clerical error when set SCTP_NODELAY
There is a clerical error when turn off Nagle's algorithm in sctp_connect_to_sock() function, this results in turn off Nagle's algorithm failure. After this correction, DLM performance will be improved obviously when using SCTP procotol. Signed-off-by: Gang He <g...@suse.com> Signed-off-by: Michal Kubecek <mkube...@suse.cz> --- fs/dlm/lowcomms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 5243989..8151252 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1080,7 +1080,7 @@ static void sctp_connect_to_sock(struct connection *con) log_print("connecting to %d", con->nodeid); /* Turn off Nagle's algorithm */ - kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *), + kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *), sizeof(one)); result = sock->ops->connect(sock, (struct sockaddr *), addr_len, -- 1.8.5.6
[Cluster-devel] [PATCH] dlm: make sctp_connect_to_sock() return in specified time
When the user setup a two-ring cluster, DLM kernel module will automatically selects to use SCTP protocol to communicate between each node. There will be about 5 minute hang in DLM kernel module, in case one ring is broken before switching to another ring, this will potentially affect the dependent upper applications, e.g. ocfs2, gfs2, clvm and clustered-MD, etc. Unfortunately, if the user setup a two-ring cluster, we can not specify DLM communication protocol with TCP explicitly, since DLM kernel module only supports SCTP protocol for multiple ring cluster. Base on my investigation, the time is spent in sock->ops->connect() function before returns ETIMEDOUT(-110) error, since O_NONBLOCK argument in connect() function does not work here, then we should make sock->ops->connect() function return in specified time via setting socket SO_SNDTIMEO atrribute. Signed-off-by: Gang He <g...@suse.com> --- fs/dlm/lowcomms.c | 12 1 file changed, 12 insertions(+) diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 5243989..b786acc 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -1037,6 +1037,7 @@ static void sctp_connect_to_sock(struct connection *con) int result; int addr_len; struct socket *sock; + struct timeval tv = { .tv_sec = 5, .tv_usec = 0 }; if (con->nodeid == 0) { log_print("attempt to connect sock 0 foiled"); @@ -1083,8 +1084,19 @@ static void sctp_connect_to_sock(struct connection *con) kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *), sizeof(one)); + /* +* Make sock->ops->connect() function return in specified time, +* since O_NONBLOCK argument in connect() function does not work here, +* then, we should restore the default value of this attribute. +*/ + kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *), + sizeof(tv)); result = sock->ops->connect(sock, (struct sockaddr *), addr_len, O_NONBLOCK); + memset(, 0, sizeof(tv)); + kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO, (char *), + sizeof(tv)); + if (result == -EINPROGRESS) result = 0; if (result == 0) -- 1.8.5.6
Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental
Hi Steven and David, >>> > Hi, > > > On 09/04/18 06:02, Gang He wrote: >> Hello David, >> >> If the user sets "protocol=tcp" in the configuration file /etc/dlm/dlm.conf > under two-rings cluster environment, >> DLM kernel module will not work with the below error message, >> [ 43.696924] DLM installed >> [ 149.552039] ocfs2: Registered cluster interface user >> [ 149.559579] dlm: TCP protocol can't handle multi-homed hosts, try SCTP > <<== here, failed >> [ 149.559589] dlm: cannot start dlm lowcomms -22 >> [ 149.559612] (mount.ocfs2,2593,3):ocfs2_dlm_init:3120 ERROR: status = -22 >> [ 149.559629] (mount.ocfs2,2593,3):ocfs2_mount_volume:1845 ERROR: status = > -22 >> >> Then, could we modify the code, let this case still work via only using one > ring address? or the code is written by purpose. >> in lowcomms.c >> 1358 static int tcp_listen_for_all(void) >> 1359 { >> 1360 struct socket *sock = NULL; >> 1361 struct connection *con = nodeid2con(0, GFP_NOFS); >> 1362 int result = -EINVAL; >> 1363 >> 1364 if (!con) >> 1365 return -ENOMEM; >> 1366 >> 1367 /* We don't support multi-homed hosts */ >> 1368 if (dlm_local_addr[1] != NULL) { <<== here, could we get ride > of this limitation? >> 1369 log_print("TCP protocol can't handle multi-homed hosts, > " >> 1370 "try SCTP"); >> 1371 return -EINVAL; >> 1372 } >> 1373 >> 1374 log_print("Using TCP for communications"); >> 1375 >> 1376 sock = tcp_create_listen_sock(con, dlm_local_addr[0]); >> 1377 if (sock) { >> 1378 add_sock(sock, con); >> 1379 result = 0; >> 1380 } >> 1381 else { >> 1382 result = -EADDRINUSE; >> 1383 } >> >> >> Thanks >> Gang >> > There is already a patch set to allow multi-homing for TCP. Mark and > Dave can comment on the current status and how far from merging it > currently is, Thanks for your update on this problem, hopefully we can the related patches in the Linus git tree soon. Thanks Gang > > Steve.
Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental
Hello David, If the user sets "protocol=tcp" in the configuration file /etc/dlm/dlm.conf under two-rings cluster environment, DLM kernel module will not work with the below error message, [ 43.696924] DLM installed [ 149.552039] ocfs2: Registered cluster interface user [ 149.559579] dlm: TCP protocol can't handle multi-homed hosts, try SCTP <<== here, failed [ 149.559589] dlm: cannot start dlm lowcomms -22 [ 149.559612] (mount.ocfs2,2593,3):ocfs2_dlm_init:3120 ERROR: status = -22 [ 149.559629] (mount.ocfs2,2593,3):ocfs2_mount_volume:1845 ERROR: status = -22 Then, could we modify the code, let this case still work via only using one ring address? or the code is written by purpose. in lowcomms.c 1358 static int tcp_listen_for_all(void) 1359 { 1360 struct socket *sock = NULL; 1361 struct connection *con = nodeid2con(0, GFP_NOFS); 1362 int result = -EINVAL; 1363 1364 if (!con) 1365 return -ENOMEM; 1366 1367 /* We don't support multi-homed hosts */ 1368 if (dlm_local_addr[1] != NULL) { <<== here, could we get ride of this limitation? 1369 log_print("TCP protocol can't handle multi-homed hosts, " 1370 "try SCTP"); 1371 return -EINVAL; 1372 } 1373 1374 log_print("Using TCP for communications"); 1375 1376 sock = tcp_create_listen_sock(con, dlm_local_addr[0]); 1377 if (sock) { 1378 add_sock(sock, con); 1379 result = 0; 1380 } 1381 else { 1382 result = -EADDRINUSE; 1383 } Thanks Gang >>> > On Mon, Apr 02, 2018 at 08:01:24PM -0600, Gang He wrote: >> OK, I got your point. >> But, could we have a appropriate way to let the users know SCTP protocol > status? > > I think this is a case where suse/rh/etc need to have their own > distro-specific approaches for specifying the usage parameters that they > have tested and found to be supportable. Other companies have previously > found their specific use of SCTP to be acceptable. RH does not properly > support dlm+SCTP for similar reasons as you've found, although I've more > recently encouraged customers to try dlm+SCTP with a single path in order > debug or diagnose potential networking issues.
Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental
Hi David, >>> > On Thu, Mar 22, 2018 at 10:27:56PM -0600, Gang He wrote: >> Hello David, >> >> Do you agree to add this prompt to the user? >> Since sometimes customers attempted to setup SCTP protocol with two rings, >> but they could not get the expected result, then it maybe bring some > concerns to the customer for DLM qualities. > > I don't think the kernel message is a good way to communicate this to users. > Dave OK, I got your point. But, could we have a appropriate way to let the users know SCTP protocol status? Thanks Gang > > >> > As you know, DLM module can use TCP or SCTP protocols to >> > communicate among the cluster. >> > But, according to our testing, SCTP protocol is still considered >> > experimental, since not all aspects are working correctly and >> > it is not full tested. >> > e.g. SCTP connection channel switch needs about 5mins hang in case >> > one connection(ring) is broken. >> > Then, I suggest to add a kernel print, which prompts the user SCTP >> > protocol for DLM should be considered experimental, it is not >> > recommended in production environment. >> > >> > Signed-off-by: Gang He <g...@suse.com> >> > --- >> > fs/dlm/lowcomms.c | 1 + >> > 1 file changed, 1 insertion(+) >> > >> > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c >> > index cff79ea..18fd85d 100644 >> > --- a/fs/dlm/lowcomms.c >> > +++ b/fs/dlm/lowcomms.c >> > @@ -1307,6 +1307,7 @@ static int sctp_listen_for_all(void) >> >return -ENOMEM; >> > >> >log_print("Using SCTP for communications"); >> > + log_print("SCTP protocol is experimental, use at your own risk"); >> > >> >result = sock_create_kern(_net, dlm_local_addr[0]->ss_family, >> > SOCK_STREAM, IPPROTO_SCTP, ); >> > -- >> > 1.8.5.6
Re: [Cluster-devel] [PATCH] dlm: prompt the user SCTP is experimental
Hello David, Do you agree to add this prompt to the user? Since sometimes customers attempted to setup SCTP protocol with two rings, but they could not get the expected result, then it maybe bring some concerns to the customer for DLM qualities. Thanks Gang >>> > As you know, DLM module can use TCP or SCTP protocols to > communicate among the cluster. > But, according to our testing, SCTP protocol is still considered > experimental, since not all aspects are working correctly and > it is not full tested. > e.g. SCTP connection channel switch needs about 5mins hang in case > one connection(ring) is broken. > Then, I suggest to add a kernel print, which prompts the user SCTP > protocol for DLM should be considered experimental, it is not > recommended in production environment. > > Signed-off-by: Gang He <g...@suse.com> > --- > fs/dlm/lowcomms.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c > index cff79ea..18fd85d 100644 > --- a/fs/dlm/lowcomms.c > +++ b/fs/dlm/lowcomms.c > @@ -1307,6 +1307,7 @@ static int sctp_listen_for_all(void) > return -ENOMEM; > > log_print("Using SCTP for communications"); > + log_print("SCTP protocol is experimental, use at your own risk"); > > result = sock_create_kern(_net, dlm_local_addr[0]->ss_family, > SOCK_STREAM, IPPROTO_SCTP, ); > -- > 1.8.5.6
Re: [Cluster-devel] [ClusterLabs] DLM connection channel switch take too long time (> 5mins)
Hi Feldhost, I use active rrp_mode in corosync.conf and reboot the cluster to let the configuration effective. But, the about 5 mins hang in new_lockspace() function is still here. Thanks Gang >>> > Hi, so try to use active mode. > > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_installatio > n_terms.html > > That fixes I saw in 4.14.* > >> On 8 Mar 2018, at 09:12, Gang He <g...@suse.com> wrote: >> >> Hi Feldhost, >> >> >>>>> >>> Hello Gang He, >>> >>> which type of corosync rrp_mode you use? Passive or Active? >> clvm1:/etc/corosync # cat corosync.conf | grep rrp_mode >>rrp_mode: passive >> >> Did you try test both? >> No, only this mode. >> Also, what kernel version you use? I see some SCTP fixes in latest kernels. >> clvm1:/etc/corosync # uname -r >> 4.4.114-94.11-default >> It looks that sock->ops->connect() function is blocked for too long time >> before > return, under broken network situation. >> In normal network, sock->ops->connect() function returns very quickly. >> >> Thanks >> Gang >> >>> >>>> On 8 Mar 2018, at 08:52, Gang He <g...@suse.com> wrote: >>>> >>>> Hello list and David Teigland, >>>> >>>> I got a problem under a two rings cluster, the problem can be reproduced >>> with the below steps. >>>> 1) setup a two rings cluster with two nodes. >>>> e.g. >>>> clvm1(nodeid 172204569) addr_list eth0 10.67.162.25 eth1 192.168.152.240 >>>> clvm2(nodeid 172204570) addr_list eth0 10.67.162.26 eth1 192.168.152.103 >>>> >>>> 2) the whole cluster works well, then I put eth0 down on node clvm2, and >>> restart pacemaker service on that node. >>>> ifconfig eth0 down >>>> rcpacemaker restart >>>> >>>> 3) the whole cluster still work well (that means corosync is very smooth >>>> to >>> switch to the other ring). >>>> Then, I can mount ocfs2 file system on node clvm2 quickly with the command >>>> mount /dev/sda /mnt/ocfs2 >>>> >>>> 4) Next, I do the same mount on node clvm1, the mount command will be >>>> hanged > >>> for about 5 mins, and finally the mount command is done. >>>> But, if we setup a ocfs2 file system resource in pacemaker, >>>> the pacemaker resource agent will consider ocfs2 file system resource >>> startup failure before this command returns, >>>> the pacemaker will fence node clvm1. >>>> This problem is impacting our customer's estimate, since they think the >>>> two >>> rings can be switched smoothly. >>>> >>>> According to this problem, I can see the mount command is hanged with the >>> below back trace, >>>> clvm1:/ # cat /proc/6688/stack >>>> [] new_lockspace+0x92d/0xa70 [dlm] >>>> [] dlm_new_lockspace+0x69/0x160 [dlm] >>>> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user] >>>> [] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue] >>>> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2] >>>> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2] >>>> [] mount_bdev+0x1a0/0x1e0 >>>> [] mount_fs+0x3a/0x170 >>>> [] vfs_kern_mount+0x62/0x110 >>>> [] do_mount+0x213/0xcd0 >>>> [] SyS_mount+0x85/0xd0 >>>> [] entry_SYSCALL_64_fastpath+0x1e/0xb6 >>>> [] 0x >>>> >>>> The root cause is in sctp_connect_to_sock() function in lowcomms.c, >>>> 1075 >>>> 1076 log_print("connecting to %d", con->nodeid); >>>> 1077 >>>> 1078 /* Turn off Nagle's algorithm */ >>>> 1079 kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *), >>>> 1080 sizeof(one)); >>>> 1081 >>>> 1082 result = sock->ops->connect(sock, (struct sockaddr *), >>> addr_len, >>>> 1083O_NONBLOCK); <<= here, this >>>> invoking >>> will cost > 5 mins before return ETIMEDOUT(-110). >>>> 1084 printk(KERN_ERR "sctp_connect_to_sock connect: %d\n", result); >>>> 1085 >>>> 1086 if (result == -EINPROGRESS) >>>> 1087 result = 0; >>>> 1088 if (result == 0) >>>> 1089 goto out; >>>> >>>> Then, I want to know if this problem was found/fixed before? >>>> it looks DLM can not switch the second ring very quickly, this will impact >>> the above application (e.g. CLVM, ocfs2) to create a new lock space before >>> it's startup. >>>> >>>> Thanks >>>> Gang >>>> >>>> >>>> ___ >>>> Users mailing list: us...@clusterlabs.org >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org
Re: [Cluster-devel] [ClusterLabs] DLM connection channel switch take too long time (> 5mins)
Hi Feldhost, >>> > Hello Gang He, > > which type of corosync rrp_mode you use? Passive or Active? clvm1:/etc/corosync # cat corosync.conf | grep rrp_mode rrp_mode: passive Did you try test both? No, only this mode. Also, what kernel version you use? I see some SCTP fixes in latest kernels. clvm1:/etc/corosync # uname -r 4.4.114-94.11-default It looks that sock->ops->connect() function is blocked for too long time before return, under broken network situation. In normal network, sock->ops->connect() function returns very quickly. Thanks Gang > >> On 8 Mar 2018, at 08:52, Gang He <g...@suse.com> wrote: >> >> Hello list and David Teigland, >> >> I got a problem under a two rings cluster, the problem can be reproduced > with the below steps. >> 1) setup a two rings cluster with two nodes. >> e.g. >> clvm1(nodeid 172204569) addr_list eth0 10.67.162.25 eth1 192.168.152.240 >> clvm2(nodeid 172204570) addr_list eth0 10.67.162.26 eth1 192.168.152.103 >> >> 2) the whole cluster works well, then I put eth0 down on node clvm2, and > restart pacemaker service on that node. >> ifconfig eth0 down >> rcpacemaker restart >> >> 3) the whole cluster still work well (that means corosync is very smooth to > switch to the other ring). >> Then, I can mount ocfs2 file system on node clvm2 quickly with the command >> mount /dev/sda /mnt/ocfs2 >> >> 4) Next, I do the same mount on node clvm1, the mount command will be hanged > for about 5 mins, and finally the mount command is done. >> But, if we setup a ocfs2 file system resource in pacemaker, >> the pacemaker resource agent will consider ocfs2 file system resource > startup failure before this command returns, >> the pacemaker will fence node clvm1. >> This problem is impacting our customer's estimate, since they think the two > rings can be switched smoothly. >> >> According to this problem, I can see the mount command is hanged with the > below back trace, >> clvm1:/ # cat /proc/6688/stack >> [] new_lockspace+0x92d/0xa70 [dlm] >> [] dlm_new_lockspace+0x69/0x160 [dlm] >> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user] >> [] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue] >> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2] >> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2] >> [] mount_bdev+0x1a0/0x1e0 >> [] mount_fs+0x3a/0x170 >> [] vfs_kern_mount+0x62/0x110 >> [] do_mount+0x213/0xcd0 >> [] SyS_mount+0x85/0xd0 >> [] entry_SYSCALL_64_fastpath+0x1e/0xb6 >> [] 0x >> >> The root cause is in sctp_connect_to_sock() function in lowcomms.c, >> 1075 >> 1076 log_print("connecting to %d", con->nodeid); >> 1077 >> 1078 /* Turn off Nagle's algorithm */ >> 1079 kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY, (char *), >> 1080 sizeof(one)); >> 1081 >> 1082 result = sock->ops->connect(sock, (struct sockaddr *), > addr_len, >> 1083O_NONBLOCK); <<= here, this >> invoking > will cost > 5 mins before return ETIMEDOUT(-110). >> 1084 printk(KERN_ERR "sctp_connect_to_sock connect: %d\n", result); >> 1085 >> 1086 if (result == -EINPROGRESS) >> 1087 result = 0; >> 1088 if (result == 0) >> 1089 goto out; >> >> Then, I want to know if this problem was found/fixed before? >> it looks DLM can not switch the second ring very quickly, this will impact > the above application (e.g. CLVM, ocfs2) to create a new lock space before > it's startup. >> >> Thanks >> Gang >> >> >> ___ >> Users mailing list: us...@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org
[Cluster-devel] [PATCH v2] dlm: Make dismatch error message more clear
This change will try to make this error message more clear, since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace to create a new lockspace with passing a cluster name. Sometimes, dlm_new_lockspace return failure while two cluster names dismatch, the user is a little confused since this line error message is not enough obvious. Signed-off-by: Gang He <g...@suse.com> --- fs/dlm/lockspace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c index 91592b7..b03d808 100644 --- a/fs/dlm/lockspace.c +++ b/fs/dlm/lockspace.c @@ -455,7 +455,8 @@ static int new_lockspace(const char *name, const char *cluster, if (dlm_config.ci_recover_callbacks && cluster && strncmp(cluster, dlm_config.ci_cluster_name, DLM_LOCKSPACE_LEN)) { - log_print("dlm cluster name %s mismatch %s", + log_print("dlm cluster name '%s' does not match " + "the application cluster name '%s'", dlm_config.ci_cluster_name, cluster); error = -EBADR; goto out; -- 1.8.5.6
[Cluster-devel] [PATCH] dlm: Make dismatch error message more clear
This change will try to make this error message more clear, since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace to create a new lockspace with passing a cluster name. Sometimes, dlm_new_lockspace return failure while two cluster names dismatch, the user is a little confused since this line error message is not enough obvious. Signed-off-by: Gang He <g...@suse.com> --- fs/dlm/lockspace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c index 91592b7..b03d808 100644 --- a/fs/dlm/lockspace.c +++ b/fs/dlm/lockspace.c @@ -455,7 +455,8 @@ static int new_lockspace(const char *name, const char *cluster, if (dlm_config.ci_recover_callbacks && cluster && strncmp(cluster, dlm_config.ci_cluster_name, DLM_LOCKSPACE_LEN)) { - log_print("dlm cluster name %s mismatch %s", + log_print("dlm configured cluster name '%s' does not match " + "the passed cluster name '%s'", dlm_config.ci_cluster_name, cluster); error = -EBADR; goto out; -- 1.8.5.6
Re: [Cluster-devel] GFS2 file system does not invalidate page cache after direct IO write
Hello Andreas, >>> > Gang, > > On Thu, May 4, 2017 at 5:33 AM, Gang He <g...@suse.com> wrote: >> Hello Guys, >> >> I found a interesting thing on GFS2 file system, After I did a direct IO > write for a whole file, I still saw there were some page caches in this > inode. >> It looks this GFS2 behavior does not follow file system POSIX semantics, I > just want to know this problem belongs to a know issue or we can fix it? >> By the way, I did the same testing on EXT4 and OCFS2 file systems, the > result looks OK. >> I will paste my testing command lines and outputs as below, >> >> For EXT4 file system, >> tb-nd1:/mnt/ext4 # rm -rf f3 >> tb-nd1:/mnt/ext4 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct >> 4+0 records in >> 4+0 records out >> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0393563 s, 107 MB/s >> tb-nd1:/mnt/ext4 # vmtouch -v f3 >> f3 >> [ ] 0/1024 >> >>Files: 1 >> Directories: 0 >> Resident Pages: 0/1024 0/4M 0% >> Elapsed: 0.000424 seconds >> tb-nd1:/mnt/ext4 # >> >> For OCFS2 file system, >> tb-nd1:/mnt/ocfs2 # rm -rf f3 >> tb-nd1:/mnt/ocfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct >> 4+0 records in >> 4+0 records out >> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0592058 s, 70.8 MB/s >> tb-nd1:/mnt/ocfs2 # vmtouch -v f3 >> f3 >> [ ] 0/1024 >> >>Files: 1 >> Directories: 0 >> Resident Pages: 0/1024 0/4M 0% >> Elapsed: 0.000226 seconds >> >> For GFS2 file system, >> tb-nd1:/mnt/gfs2 # rm -rf f3 >> tb-nd1:/mnt/gfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct >> 4+0 records in >> 4+0 records out >> 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0579509 s, 72.4 MB/s >> tb-nd1:/mnt/gfs2 # vmtouch -v f3 >> f3 >> [ oo oOo ] 48/1024 > > I cannot reproduce, at least not so easily. What kernel version is > this? If it's not a mainline kernel, can you reproduce on mainline? I always reproduce. I am using the kernel version 4.11.0-rc4-2-default, although the version is not latest, it is enough new. By the way, I add some printk in GFS2 and OCFS2 kernel module, I find GFS2 direct-IO always falls back to buffered IO, I am not sure this behavior is by-design. Of source, even GFS2 falls back to buffered IO, the code still make sure the related page cache invalidated, but the testing result is not by-expected, I need to look at the code deeply. the printk outputs like, [ 198.176774] gfs2_file_write_iter: enter ino 132419 0 - 1048576 [ 198.176785] gfs2_direct_IO: enter ino 132419 pages 0 0 - 1048576 [ 198.176787] gfs2_direct_IO: exit ino 132419 - (0) <<== here, gfs2_direct_IO always return 0, then fall back to buffered IO, his behavior is by-design? [ 198.184640] gfs2_file_write_iter: exit ino 132419 - (1048576) <<== The write_iter looks to return the right bytes. [ 198.189151] gfs2_file_write_iter: enter ino 132419 1048576 - 1048576 [ 198.189163] gfs2_direct_IO: enter ino 132419 pages 8 1048576 - 1048576 <<== here, the inode's page number is greater than zero. [ 198.189165] gfs2_direct_IO: exit ino 132419 - (0) [ 198.195901] gfs2_file_write_iter: exit ino 132419 - (1048576) But for OCFS2 [ 120.331053] ocfs2_file_write_iter: enter ino 297475 0 - 1048576 [ 120.331065] ocfs2_direct_IO: enter ino 297475 pages 0 0 - 1048576 [ 120.343129] ocfs2_direct_IO: exit ino 297475 (1048576) <<== here, ocfs2_direct_IO can return the right bytes. [ 120.343132] ocfs2_file_write_iter: exit ino 297475 - (1048576) [ 120.347705] ocfs2_file_write_iter: enter ino 297475 1048576 - 1048576 [ 120.347713] ocfs2_direct_IO: enter ino 297475 pages 0 1048576 - 1048576 <<== here, the inode's page number is always zero. [ 120.354096] ocfs2_direct_IO: exit ino 297475 (1048576) [ 120.354099] ocfs2_file_write_iter: exit ino 297475 - (1048576) Thanks Gang > > Thanks, > Andreas
[Cluster-devel] GFS2 file system does not invalidate page cache after direct IO write
Hello Guys, I found a interesting thing on GFS2 file system, After I did a direct IO write for a whole file, I still saw there were some page caches in this inode. It looks this GFS2 behavior does not follow file system POSIX semantics, I just want to know this problem belongs to a know issue or we can fix it? By the way, I did the same testing on EXT4 and OCFS2 file systems, the result looks OK. I will paste my testing command lines and outputs as below, For EXT4 file system, tb-nd1:/mnt/ext4 # rm -rf f3 tb-nd1:/mnt/ext4 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct 4+0 records in 4+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0393563 s, 107 MB/s tb-nd1:/mnt/ext4 # vmtouch -v f3 f3 [ ] 0/1024 Files: 1 Directories: 0 Resident Pages: 0/1024 0/4M 0% Elapsed: 0.000424 seconds tb-nd1:/mnt/ext4 # For OCFS2 file system, tb-nd1:/mnt/ocfs2 # rm -rf f3 tb-nd1:/mnt/ocfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct 4+0 records in 4+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0592058 s, 70.8 MB/s tb-nd1:/mnt/ocfs2 # vmtouch -v f3 f3 [ ] 0/1024 Files: 1 Directories: 0 Resident Pages: 0/1024 0/4M 0% Elapsed: 0.000226 seconds For GFS2 file system, tb-nd1:/mnt/gfs2 # rm -rf f3 tb-nd1:/mnt/gfs2 # dd if=/dev/urandom of=./f3 bs=1M count=4 oflag=direct 4+0 records in 4+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0579509 s, 72.4 MB/s tb-nd1:/mnt/gfs2 # vmtouch -v f3 f3 [ oo oOo ] 48/1024 Files: 1 Directories: 0 Resident Pages: 48/1024 192K/4M 4.69% Elapsed: 0.000287 seconds For vmtouch tool, you can download it's source code from https://github.com/hoytech/vmtouch I also printk the inode's address_space after a full file direct-IO write in kernel space, the nrpages value in the inode's address_space is always greater than zero. Thanks Gang
[Cluster-devel] inconsistent dlm_new_lockspace LVB_LEN size from ocfs2 user-space tool and ocfs2 kernel module
Hello Guys, Here is a inconsistent LVB_LEN size problem when create a new lockspace from user-space tool (e.g. fsck.ocfs2) and kernel module (e.g. ocfs2/stack_user.c). >From the userspace tool, the LVB size is DLM_USER_LVB_LEN (32 bytes, defined >in /include/linux/dlm_device.h) >From the kernel module, the LVB size is DLM_LVB_LEN (64 bytes). Why did we design like this? Look at GFS2 kernel module code, it uses 32 bytes as LVB_LEN size, it is the same size with DLM_USER_LVB_LEN macro definition. Now, We encountered a customer issue, the user did a fsck on a ocfs2 file system from one node, but aborted without release this lockspace (32bytes), then the user mounted this file system. The kernel module would use the existing same lockspace, without creating the new lockspace with 64 bytes LVB_LEN. Next, the bad result was that the user could not mount this file system from the other nodes no longer. The error messages likes, Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.430947] dlm: 032F55597DEA4A61AB065568F964174D: config mismatch: 64,0 nodeid 177127961: 32,0 Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433267] (mount.ocfs2,26981,46):ocfs2_dlm_init:2995 ERROR: status = -71 Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433325] (mount.ocfs2,26981,46):ocfs2_mount_volume:1881 ERROR: status = -71 Apr 26 16:29:16 mapkhpch1bl02 kernel: [ 3730.433376] (mount.ocfs2,26981,46):ocfs2_fill_super:1236 ERROR: status = -71 Apr 26 16:29:16 mapkhpch1bl02 Filesystem(MITC_Pool1)[26912]: ERROR: Couldn't mount filesystem /dev/disk/by-id/scsi-3600507640081010d5082 on /MITC_Pool1 Of course, the urgent fix is easy, we can reboot all the nodes, then mount the file system again. But, I want to if there were some reasons about this design, otherwise, I want to see if we can use the same size between user space and kernel module. Thanks Gang