Re: [Gluster-users] Gluster linear scale-out performance

Ravishankar N Thu, 30 Jul 2020 20:21:43 -0700


On 25/07/20 4:35 am, Artem Russakovskii wrote:

Speaking of fio, could the gluster team please help me understandsomething?
We've been having lots of performance issues related to gluster usingattached block storage on Linode. At some point, I figured out thatLinode has a cap of 500 IOPS on their block storage<https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>(with spikes to 1500 IOPS). The block storage we use is formatted xfswith 4KB bsize (block size).
I then ran a bunch of fio tests on the block storage itself (not thegluster fuse mount), which performed horribly when the bs parameterwas set to 4k:
fio--randrepeat=1--ioengine=libaio--direct=1--gtod_reduce=1--name=test--filename=test--bs=4k--iodepth=64--size=4G--readwrite=randwrite--ramp_time=4
During these tests, fio ETA crawled to over an hour, at some pointdropped to 45min and I did see 500-1500 IOPS flash by briefly, then itwent back down to 0. I/O seems majorly choked for some reason, likelybecause gluster is using some of it. Transfer speed with such 4k blocksize is 2 MB/s with spikes to 6MB/s. This causes the load on theserver to spike up to 100+ and brings down all our servers.|Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477IOPS][eta 43m:00s] Jobs: 1 (f=1):[w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s] ||xfs_info /mnt/citadel_block1 meta-data=/dev/sdc isize=512agcount=103, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 =crc=1 finobt=1, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096blocks=2684354560, imaxpct=25 = sunit=0 swidth=0 blks naming =version2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1realtime =none extsz=4096 blocks=0, rtextents=0|When I increase the --bs param to fio from 4k to, say, 64k, transferspeed goes up significantly and is more like 50MB/s, and at 256k, it's200MB/s.
So what I'm trying to understand is:

 1. How does the xfs block size (4KB) relate to the block size in fio
    tests? If we're limited by IOPS, and xfs block size is 4KB, how
    can fio produce better results with varying --bs param?
 2. Would increasing the xfs data block size to something like
    64-256KB help with our issue of choking IO and skyrocketing load?

I have experienced similar behavior when running fio tests with bs=4k ona gluster volume backed by XFS with a high load (numjobs=32) . When Iobserved the strace of the brick processes (fsync -f -T -p $PID), I sawfysnc system calls taking around 2500 seconds which is insane. I'm notsure if this is specific to the way fio does its i/o pattern and the wayXFS handles it. When I used 64k block sizes, the fio tests completedjust fine.


 1. The worst hangs and load spikes happen when we reboot one of the
    gluster servers, but not when it's down - when it comes back
    online. Even with gluster not showing anything pending heal, my
    guess is it's still trying to do lots of IO between the 4 nodes
    for some reason, but I don't understand why.

Do you kill all gluster processes (not just glusterd but even the brickprocesses) before issuing reboot? This is necessary to prevent I/Ostalls. There is stop-all-gluster-processes.sh which should be availableas a part of the gluster installation (maybe in/usr/share/glusterfs/scripts/) which you can use. Can you check if thishelps?


Regards,

Ravi

I've been banging my head on the wall with this problem for months.Appreciate any feedback here.


Thank you.

gluster volume info below

|Volume Name: SNIP_data1 Type: Replicate Volume ID: SNIP Status:Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type:tcp Bricks: Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1 Brick2:forge:/mnt/SNIP_block1/SNIP_data1 Brick3:hive:/mnt/SNIP_block1/SNIP_data1 Brick4:citadel:/mnt/SNIP_block1/SNIP_data1 Options Reconfigured:cluster.quorum-count: 1 cluster.quorum-type: fixednetwork.ping-timeout: 5 network.remote-dio: enableperformance.rda-cache-limit: 256MB performance.readdir-ahead: onperformance.parallel-readdir: on network.inode-lru-limit: 500000performance.md-cache-timeout: 600 performance.cache-invalidation: onperformance.stat-prefetch: on features.cache-invalidation-timeout: 600features.cache-invalidation: on cluster.readdir-optimize: onperformance.io-thread-count: 32 server.event-threads: 4client.event-threads: 4 performance.read-ahead: offcluster.lookup-optimize: on performance.cache-size: 1GBcluster.self-heal-daemon: enable transport.address-family: inetnfs.disable: on performance.client-io-threads: oncluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full|


Sincerely,
Artem

--

Founder, Android Police <http://www.androidpolice.com>, APK Mirror<http://www.apkmirror.com/>, Illogical Robot LLC

beerpla.net <http://beerpla.net/> | @ArtemR <http://twitter.com/ArtemR>

On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    I have one more question about the Gluster linear scale-out
    performance regarding the "write-behind off" case specifically --
    when "write-behind" is off, and still the stripe volumes and other
    settings as early thread posted, the storage I/O seems not to
    relate to the number of storage nodes. In my experiment, no matter
    I have 2 brick server nodes or 8 brick server nodes, the
    aggregated gluster I/O performance is ~100MB/sec. And fio
    benchmark measurement gives the same result. If "write behind" is
    on, then the storage performance is linear scale-out along with
    the # of brick server nodes increasing.

    No matter the write behind option is on/off, I thought the gluster
    I/O performance should be pulled and aggregated together as a
    whole. If that is the case, why do I get a consistent gluster
    performance (~100MB/sec) when "write behind" is off? Please advise
    me if I misunderstood something.

    Thanks,
    Qing




    On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <[email protected]
    <mailto:[email protected]>> wrote:

        fio gives me the correct linear scale-out results, and you're
        right, the storage cache is the root cause that makes the dd
        measurement results not accurate at all.

        Thanks,
        Qing


        On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <[email protected]
        <mailto:[email protected]>> wrote:



            On Tue, 21 Jul 2020, 21:43 Qing Wang <[email protected]
            <mailto:[email protected]>> wrote:

                Hi Yaniv,

                Thanks for the quick response. I forget to mention I
                am testing the writing performance, not reading. In
                this case, would the client cache hit rate still be a
                big issue?


            It's not hitting the storage directly. Since it's also
            single threaded, it may also not saturate it. I highly
            recommend testing properly.
            Y.


                I'll use fio to run my test once again, thanks for the
                suggestion.

                Thanks,
                Qing

                On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul
                <[email protected] <mailto:[email protected]>> wrote:



                    On Tue, 21 Jul 2020, 21:30 Qing Wang
                    <[email protected] <mailto:[email protected]>> wrote:

                        Hi,

                        I am trying to test Gluster linear scale-out
                        performance by adding more storage
                        server/bricks, and measure the storage I/O
                        performance. To vary the storage server
                        number, I create several "stripe" volumes that
                        contain 2 brick servers, 3 brick servers, 4
                        brick servers, and so on. On gluster client
                        side, I used "dd if=/dev/zero
                        of=/mnt/glusterfs/dns_test_data_26g bs=1M
                        count=26000" to create 26G data (or larger
                        size), and those data will be distributed to
                        the corresponding gluster servers (each has
                        gluster brick on it) and "dd" returns the
                        final I/O throughput. The Internet is 40G
                        infiniband, although I didn't do any specific
                        configurations to use advanced features.


                    Your dd command is inaccurate, as it'll hit the
                    client cache. It is also single threaded. I
                    suggest switching to fio.
                    Y.


                        What confuses me is that the storage I/O seems
                        not to relate to the number of storage
                        nodes, but Gluster documents said it should be
                        linear scaling. For example, when
                        "write-behind" is on, and when Infiniband
                        "jumbo frame" (connected mode) is on, I can
                        get ~800 MB/sec reported by "dd", no matter I
                        have 2 brick servers or 8 brick servers -- for
                        2 server case, each server can have ~400
                        MB/sec; for 4 server case, each server can
                        have ~200MB/sec. That said, each server I/O
                        does aggregate to the final storage I/O (800
                        MB/sec), but this is not "linear scale-out".

                        Can somebody help me to understand why this is
                        the case? I certainly can have some
                        misunderstanding/misconfiguration here. Please
                        correct me if I do, thanks!

                        Best,
                        Qing
                        ________



                        Community Meeting Calendar:

                        Schedule -
                        Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
                        Bridge: https://bluejeans.com/441850968

                        Gluster-users mailing list
                        [email protected]
                        <mailto:[email protected]>
                        https://lists.gluster.org/mailman/listinfo/gluster-users

    ________



    Community Meeting Calendar:

    Schedule -
    Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
    Bridge: https://bluejeans.com/441850968

    Gluster-users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.gluster.org/mailman/listinfo/gluster-users


________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster linear scale-out performance

Reply via email to