> > Do you kill all gluster processes (not just glusterd but even the brick > processes) before issuing reboot? This is necessary to prevent I/O stalls. > There is stop-all-gluster-processes.sh which should be available as a part > of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which > you can use. Can you check if this helps? > A reboot shuts down gracefully, so those processes are shut down before the reboot begins.
We've moved on to discussing this matter in the gluster slack, there's a lot more info there now about the above. The gist is heavy xfs fragmentation when bricks are almost full (95-96%) made healing as well as disk accesses a lot more expensive and slow, and prone to hanging. What's still not clear is why a slowdown of one brick/gluster instance affects similarly affects all bricks/gluster instances, on other servers, and how that can be optimized/mitigated. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> On Thu, Jul 30, 2020 at 8:21 PM Ravishankar N <[email protected]> wrote: > > On 25/07/20 4:35 am, Artem Russakovskii wrote: > > Speaking of fio, could the gluster team please help me understand > something? > > We've been having lots of performance issues related to gluster using > attached block storage on Linode. At some point, I figured out that Linode > has a cap of 500 IOPS on their block storage > <https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142> > (with spikes to 1500 IOPS). The block storage we use is formatted xfs with > 4KB bsize (block size). > > I then ran a bunch of fio tests on the block storage itself (not the > gluster fuse mount), which performed horribly when the bs parameter was set > to 4k: > fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 > --name=test --filename=test --bs=4k --iodepth=64 --size=4G > --readwrite=randwrite --ramp_time=4 > During these tests, fio ETA crawled to over an hour, at some point dropped > to 45min and I did see 500-1500 IOPS flash by briefly, then it went back > down to 0. I/O seems majorly choked for some reason, likely because gluster > is using some of it. Transfer speed with such 4k block size is 2 MB/s with > spikes to 6MB/s. This causes the load on the server to spike up to 100+ and > brings down all our servers. > > Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 IOPS][eta > 43m:00s] Jobs: 1 (f=1): [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta > 44m:54s] > > xfs_info /mnt/citadel_block1 > meta-data=/dev/sdc isize=512 agcount=103, agsize=26214400 > blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0, rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=2684354560, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1log > =internal log bsize=4096 blocks=51200, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > When I increase the --bs param to fio from 4k to, say, 64k, transfer speed > goes up significantly and is more like 50MB/s, and at 256k, it's 200MB/s. > > So what I'm trying to understand is: > > 1. How does the xfs block size (4KB) relate to the block size in fio > tests? If we're limited by IOPS, and xfs block size is 4KB, how can fio > produce better results with varying --bs param? > 2. Would increasing the xfs data block size to something like 64-256KB > help with our issue of choking IO and skyrocketing load? > > I have experienced similar behavior when running fio tests with bs=4k on a > gluster volume backed by XFS with a high load (numjobs=32) . When I > observed the strace of the brick processes (fsync -f -T -p $PID), I saw > fysnc system calls taking around 2500 seconds which is insane. I'm not sure > if this is specific to the way fio does its i/o pattern and the way XFS > handles it. When I used 64k block sizes, the fio tests completed just fine. > > > 1. The worst hangs and load spikes happen when we reboot one of the > gluster servers, but not when it's down - when it comes back online. Even > with gluster not showing anything pending heal, my guess is it's still > trying to do lots of IO between the 4 nodes for some reason, but I don't > understand why. > > Do you kill all gluster processes (not just glusterd but even the brick > processes) before issuing reboot? This is necessary to prevent I/O stalls. > There is stop-all-gluster-processes.sh which should be available as a part > of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which > you can use. Can you check if this helps? > > Regards, > > Ravi > > I've been banging my head on the wall with this problem for months. > Appreciate any feedback here. > > Thank you. > > gluster volume info below > > Volume Name: SNIP_data1 > Type: Replicate > Volume ID: SNIP > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 4 = 4 > Transport-type: tcp > Bricks: > Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1 > Brick2: forge:/mnt/SNIP_block1/SNIP_data1 > Brick3: hive:/mnt/SNIP_block1/SNIP_data1 > Brick4: citadel:/mnt/SNIP_block1/SNIP_data1 > Options Reconfigured: > cluster.quorum-count: 1 > cluster.quorum-type: fixed > network.ping-timeout: 5 > network.remote-dio: enable > performance.rda-cache-limit: 256MB > performance.readdir-ahead: on > performance.parallel-readdir: on > network.inode-lru-limit: 500000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > performance.stat-prefetch: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > cluster.readdir-optimize: on > performance.io-thread-count: 32 > server.event-threads: 4 > client.event-threads: 4 > performance.read-ahead: off > cluster.lookup-optimize: on > performance.cache-size: 1GB > cluster.self-heal-daemon: enable > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > cluster.granular-entry-heal: enable > cluster.data-self-heal-algorithm: full > > > Sincerely, > Artem > > -- > Founder, Android Police <http://www.androidpolice.com>, APK Mirror > <http://www.apkmirror.com/>, Illogical Robot LLC > beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > > On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <[email protected]> wrote: > >> Hi, >> >> I have one more question about the Gluster linear scale-out performance >> regarding the "write-behind off" case specifically -- when "write-behind" >> is off, and still the stripe volumes and other settings as early thread >> posted, the storage I/O seems not to relate to the number of storage >> nodes. In my experiment, no matter I have 2 brick server nodes or 8 brick >> server nodes, the aggregated gluster I/O performance is ~100MB/sec. And fio >> benchmark measurement gives the same result. If "write behind" is on, then >> the storage performance is linear scale-out along with the # of brick >> server nodes increasing. >> >> No matter the write behind option is on/off, I thought the gluster I/O >> performance should be pulled and aggregated together as a whole. If that is >> the case, why do I get a consistent gluster performance (~100MB/sec) when >> "write behind" is off? Please advise me if I misunderstood something. >> >> Thanks, >> Qing >> >> >> >> >> On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <[email protected]> wrote: >> >>> fio gives me the correct linear scale-out results, and you're right, the >>> storage cache is the root cause that makes the dd measurement results not >>> accurate at all. >>> >>> Thanks, >>> Qing >>> >>> >>> On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <[email protected]> wrote: >>> >>>> >>>> >>>> On Tue, 21 Jul 2020, 21:43 Qing Wang <[email protected]> wrote: >>>> >>>>> Hi Yaniv, >>>>> >>>>> Thanks for the quick response. I forget to mention I am testing the >>>>> writing performance, not reading. In this case, would the client cache hit >>>>> rate still be a big issue? >>>>> >>>> >>>> It's not hitting the storage directly. Since it's also single threaded, >>>> it may also not saturate it. I highly recommend testing properly. >>>> Y. >>>> >>>> >>>>> I'll use fio to run my test once again, thanks for the suggestion. >>>>> >>>>> Thanks, >>>>> Qing >>>>> >>>>> On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, 21 Jul 2020, 21:30 Qing Wang <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am trying to test Gluster linear scale-out performance by adding >>>>>>> more storage server/bricks, and measure the storage I/O performance. To >>>>>>> vary the storage server number, I create several "stripe" volumes that >>>>>>> contain 2 brick servers, 3 brick servers, 4 brick servers, and so on. On >>>>>>> gluster client side, I used "dd if=/dev/zero >>>>>>> of=/mnt/glusterfs/dns_test_data_26g bs=1M count=26000" to create 26G >>>>>>> data >>>>>>> (or larger size), and those data will be distributed to the >>>>>>> corresponding >>>>>>> gluster servers (each has gluster brick on it) and "dd" returns the >>>>>>> final >>>>>>> I/O throughput. The Internet is 40G infiniband, although I didn't do any >>>>>>> specific configurations to use advanced features. >>>>>>> >>>>>> >>>>>> Your dd command is inaccurate, as it'll hit the client cache. It is >>>>>> also single threaded. I suggest switching to fio. >>>>>> Y. >>>>>> >>>>>> >>>>>>> What confuses me is that the storage I/O seems not to relate to the >>>>>>> number of storage nodes, but Gluster documents said it should be linear >>>>>>> scaling. For example, when "write-behind" is on, and when Infiniband >>>>>>> "jumbo >>>>>>> frame" (connected mode) is on, I can get ~800 MB/sec reported by "dd", >>>>>>> no >>>>>>> matter I have 2 brick servers or 8 brick servers -- for 2 server case, >>>>>>> each >>>>>>> server can have ~400 MB/sec; for 4 server case, each server can have >>>>>>> ~200MB/sec. That said, each server I/O does aggregate to the final >>>>>>> storage >>>>>>> I/O (800 MB/sec), but this is not "linear scale-out". >>>>>>> >>>>>>> Can somebody help me to understand why this is the case? I certainly >>>>>>> can have some misunderstanding/misconfiguration here. Please correct me >>>>>>> if >>>>>>> I do, thanks! >>>>>>> >>>>>>> Best, >>>>>>> Qing >>>>>>> ________ >>>>>>> >>>>>>> >>>>>>> >>>>>>> Community Meeting Calendar: >>>>>>> >>>>>>> Schedule - >>>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>>>>>> Bridge: https://bluejeans.com/441850968 >>>>>>> >>>>>>> Gluster-users mailing list >>>>>>> [email protected] >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>> ________ >> >> >> >> Community Meeting Calendar: >> >> Schedule - >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> Bridge: https://bluejeans.com/441850968 >> >> Gluster-users mailing list >> [email protected] >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing > [email protected]https://lists.gluster.org/mailman/listinfo/gluster-users > >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
