On 04/18/2018 10:14 AM, Artem Russakovskii wrote:
Following up here on a related and very serious for us issue.

I took down one of the 4 replicate gluster servers for maintenance today. There are 2 gluster volumes totaling about 600GB. Not that much data. After the server comes back online, it starts auto healing and pretty much all operations on gluster freeze for many minutes.

For example, I was trying to run an ls -alrt in a folder with 7300 files, and it took a good 15-20 minutes before returning.

During this time, I can see iostat show 100% utilization on the brick, heal status takes many minutes to return, glusterfsd uses up tons of CPU (I saw it spike to 600%). gluster already has massive performance issues for me, but healing after a 4-hour downtime is on another level of bad perf.

For example, this command took many minutes to run:

gluster volume heal androidpolice_data3 info summary
Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
Status: Connected
Total Number of entries: 91
Number of entries in heal pending: 90
Number of entries in split-brain: 0
Number of entries possibly healing: 1

Brick forge:/mnt/forge_block4/androidpolice_data3
Status: Connected
Total Number of entries: 87
Number of entries in heal pending: 86
Number of entries in split-brain: 0
Number of entries possibly healing: 1

Brick hive:/mnt/hive_block4/androidpolice_data3
Status: Connected
Total Number of entries: 87
Number of entries in heal pending: 86
Number of entries in split-brain: 0
Number of entries possibly healing: 1

Brick citadel:/mnt/citadel_block4/androidpolice_data3
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0


Statistics showed a diminishing number of failed heals:
...
Ending time of crawl: Tue Apr 17 21:13:08 2018

Type of crawl: INDEX
No. of entries healed: 2
No. of entries in split-brain: 0
No. of heal failed entries: 102

Starting time of crawl: Tue Apr 17 21:13:09 2018

Ending time of crawl: Tue Apr 17 21:14:30 2018

Type of crawl: INDEX
No. of entries healed: 4
No. of entries in split-brain: 0
No. of heal failed entries: 91

Starting time of crawl: Tue Apr 17 21:14:31 2018

Ending time of crawl: Tue Apr 17 21:15:34 2018

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 88
...

Eventually, everything heals and goes back to at least where the roof isn't on fire anymore.

The server stats and volume options were given in one of the previous replies to this thread.

Any ideas or things I could run and show the output of to help diagnose? I'm also very open to working with someone on the team on a live debugging session if there's interest.

It is likely that self-heal is causing the CPU spike due to the flood of lookups/ locks and checksum fops that the self-heal-daemon sends to the bricks. There's a script to control shd's cpu usage using cgroups. That should help in regulating self-heal traffic: https://review.gluster.org/#/c/18404/ (see extras/control-cpu-load.sh) Other self-heal related volume options that you could change are setting 'cluster.data-self-heal-algorithm' to 'full' and 'granular-entry-heal' to 'enable'.  `gluster volume set help` should give you more information about these options.
Thanks,
Ravi


Thank you.


Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net <http://beerpla.net/> | +ArtemRussakovskii <https://plus.google.com/+ArtemRussakovskii> | @ArtemR <http://twitter.com/ArtemR>

On Tue, Apr 10, 2018 at 9:56 AM, Artem Russakovskii <[email protected] <mailto:[email protected]>> wrote:

    Hi Vlad,

    I actually saw that post already and even asked a question 4 days
    ago
    
(https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917
    
<https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917>).
    The accepted answer also seems to go against your suggestion to
    enable direct-io-mode as it says it should be disabled for better
    performance when used just for file accesses.

    It'd be great if someone from the Gluster team chimed in about
    this thread.


    Sincerely,
    Artem

    --
    Founder, Android Police <http://www.androidpolice.com>, APK Mirror
    <http://www.apkmirror.com/>, Illogical Robot LLC
    beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
    <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
    <http://twitter.com/ArtemR>

    On Tue, Apr 10, 2018 at 7:01 AM, Vlad Kopylov <[email protected]
    <mailto:[email protected]>> wrote:

        Wish I knew or was able to get detailed description of those
        options myself.
        here is direct-io-mode
        https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode
        <https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode>
        Same as you I ran tests on a large volume of files, finding
        that main delays are in attribute calls, ending up with those
        mount options to add performance.
        I discovered those options through basically googling this
        user list with people sharing their tests.
        Not sure I would share your optimism, and rather then going up
        I downgraded to 3.12 and have no dir view issue now. Though I
        had to recreate the cluster and had to re-add bricks with
        existing data.

        On Tue, Apr 10, 2018 at 1:47 AM, Artem Russakovskii
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Vlad,

            I'm using only localhost: mounts.

            Can you please explain what effect each option has on
            performance issues shown in my posts?
            
"negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
            From what I remember, direct-io-mode=enable didn't make a
            difference in my tests, but I suppose I can try again. The
            explanations about direct-io-mode are quite confusing on
            the web in various guides, saying enabling it could make
            performance worse in some situations and better in others
            due to OS file cache.

            There are also these gluster volume settings, adding to
            the confusion:
            Option: performance.strict-o-direct
            Default Value: off
            Description: This option when set to off, ignores the
            O_DIRECT flag.

            Option: performance.nfs.strict-o-direct
            Default Value: off
            Description: This option when set to off, ignores the
            O_DIRECT flag.

            Re: 4.0. I moved to 4.0 after finding out that it fixes
            the disappearing dirs bug related to
            cluster.readdir-optimize if you remember
            
(http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html
            
<http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html>).
            I was already on 3.13 by then, and 4.0 resolved the issue.
            It's been stable for me so far, thankfully.


            Sincerely,
            Artem

            --
            Founder, Android Police <http://www.androidpolice.com>,
            APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC
            beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
            <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
            <http://twitter.com/ArtemR>

            On Mon, Apr 9, 2018 at 10:38 PM, Vlad Kopylov
            <[email protected] <mailto:[email protected]>> wrote:

                you definitely need mount options to /etc/fstab
                use ones from here
                
http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html
                
<http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html>

                I went on with using local mounts to achieve
                performance as well

                Also, 3.12 or 3.10 branches would be preferable for
                production

                On Fri, Apr 6, 2018 at 4:12 AM, Artem Russakovskii
                <[email protected] <mailto:[email protected]>> wrote:

                    Hi again,

                    I'd like to expand on the performance issues and
                    plead for help. Here's one case which shows these
                    odd hiccups: https://i.imgur.com/CXBPjTK.gifv
                    <https://i.imgur.com/CXBPjTK.gifv>.

                    In this GIF where I switch back and forth between
                    copy operations on 2 servers, I'm copying a 10GB
                    dir full of .apk and image files.

                    On server "hive" I'm copying straight from the
                    main disk to an attached volume block (xfs). As
                    you can see, the transfers are relatively speedy
                    and don't hiccup.
                    On server "citadel" I'm copying the same set of
                    data to a 4-replicate gluster which uses block
                    storage as a brick. As you can see, performance is
                    much worse, and there are frequent pauses for many
                    seconds where nothing seems to be happening - just
                    freezes.

                    All 4 servers have the same specs, and all of them
                    have performance issues with gluster and no such
                    issues when raw xfs block storage is used.

                    hive has long finished copying the data, while
                    citadel is barely chugging along and is expected
                    to take probably half an hour to an hour. I have
                    over 1TB of data to migrate, at which point if we
                    went live, I'm not even sure gluster would be able
                    to keep up instead of bringing the machines and
                    services down.



                    Here's the cluster config, though it didn't seem
                    to make any difference performance-wise before I
                    applied the customizations vs after.

                    Volume Name: apkmirror_data1
                    Type: Replicate
                    Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e
                    Status: Started
                    Snapshot Count: 0
                    Number of Bricks: 1 x 4 = 4
                    Transport-type: tcp
                    Bricks:
                    Brick1: nexus2:/mnt/nexus2_block1/apkmirror_data1
                    Brick2: forge:/mnt/forge_block1/apkmirror_data1
                    Brick3: hive:/mnt/hive_block1/apkmirror_data1
                    Brick4: citadel:/mnt/citadel_block1/apkmirror_data1
                    Options Reconfigured:
                    cluster.quorum-count: 1
                    cluster.quorum-type: fixed
                    network.ping-timeout: 5
                    network.remote-dio: enable
                    performance.rda-cache-limit: 256MB
                    performance.readdir-ahead: on
                    performance.parallel-readdir: on
                    network.inode-lru-limit: 500000
                    performance.md-cache-timeout: 600
                    performance.cache-invalidation: on
                    performance.stat-prefetch: on
                    features.cache-invalidation-timeout: 600
                    features.cache-invalidation: on
                    cluster.readdir-optimize: on
                    performance.io-thread-count: 32
                    server.event-threads: 4
                    client.event-threads: 4
                    performance.read-ahead: off
                    cluster.lookup-optimize: on
                    performance.cache-size: 1GB
                    cluster.self-heal-daemon: enable
                    transport.address-family: inet
                    nfs.disable: on
                    performance.client-io-threads: on


                    The mounts are done as follows in /etc/fstab:
                    /dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
                    /mnt/citadel_block1 xfs defaults 0 2
                    localhost:/apkmirror_data1 /mnt/apkmirror_data1
                    glusterfs defaults,_netdev 0 0

                    I'm really not sure if direct-io-mode mount tweaks
                    would do anything here, what the value should be
                    set to, and what it is by default.

                    The OS is OpenSUSE 42.3, 64-bit. 80GB of RAM, 20
                    CPUs, hosted by Linode.

                    I'd really appreciate any help in the matter.

                    Thank you.


                    Sincerely,
                    Artem

                    --
                    Founder, Android Police
                    <http://www.androidpolice.com>, APK Mirror
                    <http://www.apkmirror.com/>, Illogical Robot LLC
                    beerpla.net <http://beerpla.net/> |
                    +ArtemRussakovskii
                    <https://plus.google.com/+ArtemRussakovskii> |
                    @ArtemR <http://twitter.com/ArtemR>

                    On Thu, Apr 5, 2018 at 11:13 PM, Artem
                    Russakovskii <[email protected]
                    <mailto:[email protected]>> wrote:

                        Hi,

                        I'm trying to squeeze performance out of
                        gluster on 4 80GB RAM 20-CPU machines where
                        Gluster runs on attached block storage
                        (Linode) in (4 replicate bricks), and so far
                        everything I tried results in sub-optimal
                        performance.

                        There are many files - mostly images, several
                        million - and many operations take minutes,
                        copying multiple files (even if they're small)
                        suddenly freezes up for seconds at a time,
                        then continues, iostat frequently shows large
                        r_await and w_awaits with 100% utilization for
                        the attached block device, etc.

                        But anyway, there are many guides out there
                        for small-file performance improvements, but
                        more explanation is needed, and I think more
                        tweaks should be possible.

                        My question today is
                        about performance.cache-size. Is this a size
                        of cache in RAM? If so, how do I view the
                        current cache size to see if it gets full and
                        I should increase its size? Is it advisable to
                        bump it up if I have many tens of gigs of RAM
                        free?



                        More generally, in the last 2 months since I
                        first started working with gluster and set a
                        production system live, I've been feeling
                        frustrated because Gluster has a lot of
                        poorly-documented and confusing options. I
                        really wish documentation could be improved
                        with examples and better explanations.

                        Specifically, it'd be absolutely amazing if
                        the docs offered a strategy for setting each
                        value and ways of determining more optimal
                        values. For example,
                        for performance.cache-size, if it said
                        something like "run command abc to see your
                        current cache size, and if it's hurting, up
                        it, but be aware that it's limited by RAM,"
                        it'd be already a huge improvement to the
                        docs. And so on with other options.



                        The gluster team is quite helpful on this
                        mailing list, but in a reactive rather than
                        proactive way. Perhaps it's tunnel vision once
                        you've worked on a project for so long where
                        less technical explanations and even proper
                        documentation of options takes a back seat,
                        but I encourage you to be more proactive about
                        helping us understand and optimize Gluster.

                        Thank you.

                        Sincerely,
                        Artem

                        --
                        Founder, Android Police
                        <http://www.androidpolice.com>, APK Mirror
                        <http://www.apkmirror.com/>, Illogical Robot LLC
                        beerpla.net <http://beerpla.net/> |
                        +ArtemRussakovskii
                        <https://plus.google.com/+ArtemRussakovskii> |
                        @ArtemR <http://twitter.com/ArtemR>



                    _______________________________________________
                    Gluster-users mailing list
                    [email protected]
                    <mailto:[email protected]>
                    http://lists.gluster.org/mailman/listinfo/gluster-users
                    <http://lists.gluster.org/mailman/listinfo/gluster-users>








_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to