Re: [Gluster-devel] [Gluster-infra] NetBSD regression fixes
Emmanuel Dreyfus wrote: > But I just realized the change is wrong, since running tests "new way" > stops on first failed test. My change just retry the failed test and > considers the regression run to be good on success, without running next > tests. > > I will post an update shortly. Done: http://review.gluster.org/13245 http://review.gluster.org/13247 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-infra] NetBSD regression fixes
Niels de Vos wrote: > > 2) Spurious failures > > I added a retry-failed-test-once feature so that we get less regression > > failures because of spurious failures. It is not used right now because > > it does not play nicely with bad tests blacklist. > > > > This will be fixed by that changes: > > http://review.gluster.org/13245 > > http://review.gluster.org/13247 > > > > I have been looping failure-free regression for a while with that trick. > > Nice, thanks for these improvements! But I just realized the change is wrong, since running tests "new way" stops on first failed test. My change just retry the failed test and considers the regression run to be good on success, without running next tests. I will post an update shortly. > Could you send a pull request for the regression.sh script on > https://github.com/gluster/glusterfs-patch-acceptance-tests/ ? Or, if > you dont use GitHub, send the patch by email and we'll take care of > pushing it for you. Sure, but let me settle on something that works first. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
Wrong assumption, rsync hung again. On субота, 16 січня 2016 р. 22:53:04 EET Oleksandr Natalenko wrote: > One possible reason: > > cluster.lookup-optimize: on > cluster.readdir-optimize: on > > I've disabled both optimizations, and at least as of now rsync still does > its job with no issues. I would like to find out what option causes such a > behavior and why. Will test more. > > On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > > Another observation: if rsyncing is resumed after hang, rsync itself > > hangs a lot faster because it does stat of already copied files. So, the > > reason may be not writing itself, but massive stat on GlusterFS volume > > as well. > > > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > > While doing rsync over millions of files from ordinary partition to > > > GlusterFS volume, just after approx. first 2 million rsync hang > > > happens, and the following info appears in dmesg: > > > > > > === > > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > > seconds. > > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > > disables this message. > > > [17075038.940748] rsync D 88207fc13680 0 10310 > > > 10309 0x0080 > > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > > 00013680 > > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > > 881159d16a30 > > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > > 88087d553980 > > > [17075038.940762] Call Trace: > > > [17075038.940770] [] schedule+0x29/0x70 > > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > > [fuse] > > > [17075038.940801] [] ? > > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > > [fuse] > > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > > [17075038.940817] [] filp_close+0x34/0x80 > > > [17075038.940821] [] __close_fd+0x78/0xa0 > > > [17075038.940824] [] SyS_close+0x23/0x50 > > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > > === > > > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > > process. Then rsync exits. > > > > > > Here is GlusterFS volume info: > > > > > > === > > > Volume Name: asterisk_records > > > Type: Distributed-Replicate > > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > > Status: Started > > > Number of Bricks: 3 x 2 = 6 > > > Transport-type: tcp > > > Bricks: > > > Brick1: > > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/as > > > te > > > risk/records Brick2: > > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/ > > > as > > > terisk/records Brick3: > > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/as > > > te > > > risk/records Brick4: > > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/ > > > as > > > terisk/records Brick5: > > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/ > > > as > > > terisk/records Brick6: > > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03 > > > /a > > > sterisk/records Options Reconfigured: > > > cluster.lookup-optimize: on > > > cluster.readdir-optimize: on > > > client.event-threads: 2 > > > network.inode-lru-limit: 4096 > > > server.event-threads: 4 > > > performance.client-io-threads: on > > > storage.linux-aio: on > > > performance.write-behind-window-size: 4194304 > > > performance.stat-prefetch: on > > > performance.quick-read: on > > > performance.read-ahead: on > > > performance.flush-behind: on > > > performance.write-behind: on > > > performance.io-thread-count: 2 > > > performance.cache-max-file-size: 1048576 > > > performance.cache-size: 33554432 > > > features.cache-invalidation: on > > > performance.readdir-ahead: on > > > === > > > > > > The issue reproduces each time I rsync such an amount of files. > > > > > > How could I debug this issue better? > > > ___ > > > Gluster-users mailing list > > > gluster-us...@gluster.org > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file
One possible reason: cluster.lookup-optimize: on cluster.readdir-optimize: on I've disabled both optimizations, and at least as of now rsync still does its job with no issues. I would like to find out what option causes such a behavior and why. Will test more. On пʼятниця, 15 січня 2016 р. 16:09:51 EET Oleksandr Natalenko wrote: > Another observation: if rsyncing is resumed after hang, rsync itself > hangs a lot faster because it does stat of already copied files. So, the > reason may be not writing itself, but massive stat on GlusterFS volume > as well. > > 15.01.2016 09:40, Oleksandr Natalenko написав: > > While doing rsync over millions of files from ordinary partition to > > GlusterFS volume, just after approx. first 2 million rsync hang > > happens, and the following info appears in dmesg: > > > > === > > [17075038.924481] INFO: task rsync:10310 blocked for more than 120 > > seconds. > > [17075038.931948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [17075038.940748] rsync D 88207fc13680 0 10310 > > 10309 0x0080 > > [17075038.940752] 8809c578be18 0086 8809c578bfd8 > > 00013680 > > [17075038.940756] 8809c578bfd8 00013680 880310cbe660 > > 881159d16a30 > > [17075038.940759] 881e3aa25800 8809c578be48 881159d16b10 > > 88087d553980 > > [17075038.940762] Call Trace: > > [17075038.940770] [] schedule+0x29/0x70 > > [17075038.940797] [] __fuse_request_send+0x13d/0x2c0 > > [fuse] > > [17075038.940801] [] ? > > fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] > > [17075038.940805] [] ? wake_up_bit+0x30/0x30 > > [17075038.940809] [] fuse_request_send+0x12/0x20 > > [fuse] > > [17075038.940813] [] fuse_flush+0xff/0x150 [fuse] > > [17075038.940817] [] filp_close+0x34/0x80 > > [17075038.940821] [] __close_fd+0x78/0xa0 > > [17075038.940824] [] SyS_close+0x23/0x50 > > [17075038.940828] [] system_call_fastpath+0x16/0x1b > > === > > > > rsync blocks in D state, and to kill it, I have to do umount --lazy on > > GlusterFS mountpoint, and then kill corresponding client glusterfs > > process. Then rsync exits. > > > > Here is GlusterFS volume info: > > > > === > > Volume Name: asterisk_records > > Type: Distributed-Replicate > > Volume ID: dc1fe561-fa3a-4f2e-8330-ec7e52c75ba4 > > Status: Started > > Number of Bricks: 3 x 2 = 6 > > Transport-type: tcp > > Bricks: > > Brick1: > > server1:/bricks/10_megaraid_0_3_9_x_0_4_3_hdd_r1_nolvm_hdd_storage_01/aste > > risk/records Brick2: > > server2:/bricks/10_megaraid_8_5_14_x_8_6_16_hdd_r1_nolvm_hdd_storage_01/as > > terisk/records Brick3: > > server1:/bricks/11_megaraid_0_5_4_x_0_6_5_hdd_r1_nolvm_hdd_storage_02/aste > > risk/records Brick4: > > server2:/bricks/11_megaraid_8_7_15_x_8_8_20_hdd_r1_nolvm_hdd_storage_02/as > > terisk/records Brick5: > > server1:/bricks/12_megaraid_0_7_6_x_0_13_14_hdd_r1_nolvm_hdd_storage_03/as > > terisk/records Brick6: > > server2:/bricks/12_megaraid_8_9_19_x_8_13_24_hdd_r1_nolvm_hdd_storage_03/a > > sterisk/records Options Reconfigured: > > cluster.lookup-optimize: on > > cluster.readdir-optimize: on > > client.event-threads: 2 > > network.inode-lru-limit: 4096 > > server.event-threads: 4 > > performance.client-io-threads: on > > storage.linux-aio: on > > performance.write-behind-window-size: 4194304 > > performance.stat-prefetch: on > > performance.quick-read: on > > performance.read-ahead: on > > performance.flush-behind: on > > performance.write-behind: on > > performance.io-thread-count: 2 > > performance.cache-max-file-size: 1048576 > > performance.cache-size: 33554432 > > features.cache-invalidation: on > > performance.readdir-ahead: on > > === > > > > The issue reproduces each time I rsync such an amount of files. > > > > How could I debug this issue better? > > ___ > > Gluster-users mailing list > > gluster-us...@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-infra] NetBSD regression fixes
On Sat, Jan 16, 2016 at 06:55:49PM +0100, Emmanuel Dreyfus wrote: > Hello all > > Here are the problems identified in NetBSD regression so far: > > 1) Before starting regression, slave compains about "vnconfig: > VNDIOCGET: Bad file descriptor" and fails the run. > > This will be fixed by that changes: > http://review.gluster.org/13204 > http://review.gluster.org/13205 > > > 2) Spurious failures > I added a retry-failed-test-once feature so that we get less regression > failures because of spurious failures. It is not used right now because > it does not play nicely with bad tests blacklist. > > This will be fixed by that changes: > http://review.gluster.org/13245 > http://review.gluster.org/13247 > > I have been looping failure-free regression for a while with that trick. Nice, thanks for these improvements! > 3) Stale state from previous regression > We sometime have processes stuck from previous regression, awaiting > vnode locks for destroyed NFS filesystems. This cause starting cleanup > scripts to hang before starting regression and we get a timeout. > > I modified slave's /opt/qa/regression.sh to check for stuck processes > and reboot the system if we find them. That will fail the current > regression run, but at least the next ones coming after reboot will be > safe. > > This fix is not deployed yet, I await the fixes from point 2 to be > merged Could you send a pull request for the regression.sh script on https://github.com/gluster/glusterfs-patch-acceptance-tests/ ? Or, if you dont use GitHub, send the patch by email and we'll take care of pushing it for you. > 4) Jenkins casts concurent runs on the same slave > We observed Jenkins sometimes runs two jobs on the same slave at once, > which of course can only lead to horrible failure. > > I modified slave's /opt/qa/regression.sh to add a lock file so that this > situation is detected early and reported. The second regression will > fail, but the idea is to get a better understanding of how that can > occur. > > This fix is not deployed yet, I await the fixes from point 2 to be > merged Hmm, I have not seen that before, but it surely is something to be concerned about :-/ Thanks, Niels signature.asc Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] NetBSD regression fixes
Hello all Here are the problems identified in NetBSD regression so far: 1) Before starting regression, slave compains about "vnconfig: VNDIOCGET: Bad file descriptor" and fails the run. This will be fixed by that changes: http://review.gluster.org/13204 http://review.gluster.org/13205 2) Spurious failures I added a retry-failed-test-once feature so that we get less regression failures because of spurious failures. It is not used right now because it does not play nicely with bad tests blacklist. This will be fixed by that changes: http://review.gluster.org/13245 http://review.gluster.org/13247 I have been looping failure-free regression for a while with that trick. 3) Stale state from previous regression We sometime have processes stuck from previous regression, awaiting vnode locks for destroyed NFS filesystems. This cause starting cleanup scripts to hang before starting regression and we get a timeout. I modified slave's /opt/qa/regression.sh to check for stuck processes and reboot the system if we find them. That will fail the current regression run, but at least the next ones coming after reboot will be safe. This fix is not deployed yet, I await the fixes from point 2 to be merged 4) Jenkins casts concurent runs on the same slave We observed Jenkins sometimes runs two jobs on the same slave at once, which of course can only lead to horrible failure. I modified slave's /opt/qa/regression.sh to add a lock file so that this situation is detected early and reported. The second regression will fail, but the idea is to get a better understanding of how that can occur. This fix is not deployed yet, I await the fixes from point 2 to be merged -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel