Re: [Gluster-users] performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs

Ravishankar N Tue, 17 Apr 2018 23:49:55 -0700


On 04/18/2018 11:59 AM, Artem Russakovskii wrote:

Btw, I've now noticed at least 5 variations in toggling binary optionvalues. Are they all interchangeable, or will using the wrong valuenot work in some cases?
yes/no
true/false
True/False
on/off
enable/disable
It's quite a confusing/inconsistent practice, especially given thatmany options will accept any value without erroring out/validation.


All these options are okay.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror<http://www.apkmirror.com/>, Illogical Robot LLCbeerpla.net <http://beerpla.net/> | +ArtemRussakovskii<https://plus.google.com/+ArtemRussakovskii> | @ArtemR<http://twitter.com/ArtemR>
On Tue, Apr 17, 2018 at 11:22 PM, Artem Russakovskii<[email protected] <mailto:[email protected]>> wrote:
    Thanks for the link. Looking at the status of that doc, it isn't
    quite ready yet, and there's no mention of the option.

No, this is a completed feature available since 3.8 IIRC. You can use itsafely. There is a difference in how to enable it though. Instead ofusing 'gluster volume set ...', you need to use 'gluster volume heal<volname> granular-entry-heal enable' to turn it on. If there are nopending heals, it will run successfully. Otherwise you need to waituntil heals are over (i.e. heal info shows zero entries). Just followwhat the CLI says and you should be fine.


-Ravi



    Does it mean that whatever is ready now in 4.0.1 is incomplete but
    can be enabled via granular-entry-heal=on, and when it is
    complete, it'll become the default and the flag will simply go away?

    Is there any risk enabling the option now in 4.0.1?


    Sincerely,
    Artem

    --
    Founder, Android Police <http://www.androidpolice.com>, APK Mirror
    <http://www.apkmirror.com/>, Illogical Robot LLC
    beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
    <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
    <http://twitter.com/ArtemR>

    On Tue, Apr 17, 2018 at 11:16 PM, Ravishankar N
    <[email protected] <mailto:[email protected]>> wrote:



        On 04/18/2018 10:35 AM, Artem Russakovskii wrote:

        Hi Ravi,

        Could you please expand on how these would help?

        By forcing full here, we move the logic from the CPU to
        network, thus decreasing CPU utilization, is that right?

        Yes, 'diff' employs the rchecksum FOP which does a sha256 
        checksum which can consume CPU. So yes it is sort of shifting
        the load from CPU to the network. But if your average file
        size is small, it would make sense to copy the entire file
        instead of computing checksums.

        This is assuming the CPU and disk utilization are caused by
        the differ and not by lstat and other calls or something.

            Option: cluster.data-self-heal-algorithm
            Default Value: (null)
            Description: Select between "full", "diff". The "full"
            algorithm copies the entire file from source to sink. The
            "diff" algorithm copies to sink only those blocks whose
            checksums don't match with those of source. If no option
            is configured the option is chosen dynamically as
            follows: If the file does not exist on one of the sinks
            or empty file exists or if the source file size is about
            the same as page size the entire file will be read and
            written i.e "full" algo, otherwise "diff" algo is chosen.


        I really have no idea what this means and how/why it would
        help. Any more info on this option?


        
https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/granular-entry-self-healing.md
        
<https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.8/granular-entry-self-healing.md>
        should help.
        Regards,
        Ravi

            Option: cluster.granular-entry-heal
            Default Value: no
            Description: If this option is enabled, self-heal will
            resort to granular way of recording changelogs and doing
            entry self-heal.


        Thank you.


        Sincerely,
        Artem

        --
        Founder, Android Police <http://www.androidpolice.com>, APK
        Mirror <http://www.apkmirror.com/>, Illogical Robot LLC
        beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
        <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
        <http://twitter.com/ArtemR>

        On Tue, Apr 17, 2018 at 9:58 PM, Ravishankar N
        <[email protected] <mailto:[email protected]>> wrote:


            On 04/18/2018 10:14 AM, Artem Russakovskii wrote:

            Following up here on a related and very serious for us
            issue.

            I took down one of the 4 replicate gluster servers for
            maintenance today. There are 2 gluster volumes totaling
            about 600GB. Not that much data. After the server comes
            back online, it starts auto healing and pretty much all
            operations on gluster freeze for many minutes.

            For example, I was trying to run an ls -alrt in a folder
            with 7300 files, and it took a good 15-20 minutes before
            returning.

            During this time, I can see iostat show 100% utilization
            on the brick, heal status takes many minutes to return,
            glusterfsd uses up tons of CPU (I saw it spike to 600%).
            gluster already has massive performance issues for me,
            but healing after a 4-hour downtime is on another level
            of bad perf.

            For example, this command took many minutes to run:

            gluster volume heal androidpolice_data3 info summary
            Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
            Status: Connected
            Total Number of entries: 91
            Number of entries in heal pending: 90
            Number of entries in split-brain: 0
            Number of entries possibly healing: 1

            Brick forge:/mnt/forge_block4/androidpolice_data3
            Status: Connected
            Total Number of entries: 87
            Number of entries in heal pending: 86
            Number of entries in split-brain: 0
            Number of entries possibly healing: 1

            Brick hive:/mnt/hive_block4/androidpolice_data3
            Status: Connected
            Total Number of entries: 87
            Number of entries in heal pending: 86
            Number of entries in split-brain: 0
            Number of entries possibly healing: 1

            Brick citadel:/mnt/citadel_block4/androidpolice_data3
            Status: Connected
            Total Number of entries: 0
            Number of entries in heal pending: 0
            Number of entries in split-brain: 0
            Number of entries possibly healing: 0


            Statistics showed a diminishing number of failed heals:
            ...
            Ending time of crawl: Tue Apr 17 21:13:08 2018

            Type of crawl: INDEX
            No. of entries healed: 2
            No. of entries in split-brain: 0
            No. of heal failed entries: 102

            Starting time of crawl: Tue Apr 17 21:13:09 2018

            Ending time of crawl: Tue Apr 17 21:14:30 2018

            Type of crawl: INDEX
            No. of entries healed: 4
            No. of entries in split-brain: 0
            No. of heal failed entries: 91

            Starting time of crawl: Tue Apr 17 21:14:31 2018

            Ending time of crawl: Tue Apr 17 21:15:34 2018

            Type of crawl: INDEX
            No. of entries healed: 0
            No. of entries in split-brain: 0
            No. of heal failed entries: 88
            ...

            Eventually, everything heals and goes back to at least
            where the roof isn't on fire anymore.

            The server stats and volume options were given in one of
            the previous replies to this thread.

            Any ideas or things I could run and show the output of
            to help diagnose? I'm also very open to working with
            someone on the team on a live debugging session if
            there's interest.


            It is likely that self-heal is causing the CPU spike due
            to the flood of lookups/ locks and checksum fops that the
            self-heal-daemon sends to the bricks.
            There's a script to control shd's cpu usage using
            cgroups. That should help in regulating self-heal
            traffic: https://review.gluster.org/#/c/18404/
            <https://review.gluster.org/#/c/18404/> (see
            extras/control-cpu-load.sh)
            Other self-heal related volume options that you could
            change are setting 'cluster.data-self-heal-algorithm' to
            'full' and 'granular-entry-heal' to 'enable'.  `gluster
            volume set help` should give you more information about
            these options.
            Thanks,
            Ravi


            Thank you.


            Sincerely,
            Artem

            --
            Founder, Android Police <http://www.androidpolice.com>,
            APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC
            beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
            <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
            <http://twitter.com/ArtemR>

            On Tue, Apr 10, 2018 at 9:56 AM, Artem Russakovskii
            <[email protected] <mailto:[email protected]>> wrote:

                Hi Vlad,

                I actually saw that post already and even asked a
                question 4 days ago
                
(https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917
                
<https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917>).
                The accepted answer also seems to go against your
                suggestion to enable direct-io-mode as it says it
                should be disabled for better performance when used
                just for file accesses.

                It'd be great if someone from the Gluster team
                chimed in about this thread.


                Sincerely,
                Artem

                --
                Founder, Android Police
                <http://www.androidpolice.com>, APK Mirror
                <http://www.apkmirror.com/>, Illogical Robot LLC
                beerpla.net <http://beerpla.net/> |
                +ArtemRussakovskii
                <https://plus.google.com/+ArtemRussakovskii> |
                @ArtemR <http://twitter.com/ArtemR>

                On Tue, Apr 10, 2018 at 7:01 AM, Vlad Kopylov
                <[email protected] <mailto:[email protected]>> wrote:

                    Wish I knew or was able to get detailed
                    description of those options myself.
                    here is direct-io-mode
                    
https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode
                    
<https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode>
                    Same as you I ran tests on a large volume of
                    files, finding that main delays are in attribute
                    calls, ending up with those mount options to add
                    performance.
                    I discovered those options through basically
                    googling this user list with people sharing
                    their tests.
                    Not sure I would share your optimism, and rather
                    then going up I downgraded to 3.12 and have no
                    dir view issue now. Though I had to recreate the
                    cluster and had to re-add bricks with existing data.

                    On Tue, Apr 10, 2018 at 1:47 AM, Artem
                    Russakovskii <[email protected]
                    <mailto:[email protected]>> wrote:

                        Hi Vlad,

                        I'm using only localhost: mounts.

                        Can you please explain what effect each
                        option has on performance issues shown in my
                        posts?
                        
"negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
                        From what I remember, direct-io-mode=enable
                        didn't make a difference in my tests, but I
                        suppose I can try again. The explanations
                        about direct-io-mode are quite confusing on
                        the web in various guides, saying enabling
                        it could make performance worse in some
                        situations and better in others due to OS
                        file cache.

                        There are also these gluster volume
                        settings, adding to the confusion:
                        Option: performance.strict-o-direct
                        Default Value: off
                        Description: This option when set to off,
                        ignores the O_DIRECT flag.

                        Option: performance.nfs.strict-o-direct
                        Default Value: off
                        Description: This option when set to off,
                        ignores the O_DIRECT flag.

                        Re: 4.0. I moved to 4.0 after finding out
                        that it fixes the disappearing dirs bug
                        related to cluster.readdir-optimize if you
                        remember
                        
(http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html
                        
<http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html>).
                        I was already on 3.13 by then, and 4.0
                        resolved the issue. It's been stable for me
                        so far, thankfully.


                        Sincerely,
                        Artem

                        --
                        Founder, Android Police
                        <http://www.androidpolice.com>, APK Mirror
                        <http://www.apkmirror.com/>, Illogical Robot LLC
                        beerpla.net <http://beerpla.net/> |
                        +ArtemRussakovskii
                        <https://plus.google.com/+ArtemRussakovskii>
                        | @ArtemR <http://twitter.com/ArtemR>

                        On Mon, Apr 9, 2018 at 10:38 PM, Vlad
                        Kopylov <[email protected]
                        <mailto:[email protected]>> wrote:

                            you definitely need mount options to
                            /etc/fstab
                            use ones from here
                            
http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html
                            
<http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html>

                            I went on with using local mounts to
                            achieve performance as well

                            Also, 3.12 or 3.10 branches would be
                            preferable for production

                            On Fri, Apr 6, 2018 at 4:12 AM, Artem
                            Russakovskii <[email protected]
                            <mailto:[email protected]>> wrote:

                                Hi again,

                                I'd like to expand on the
                                performance issues and plead for
                                help. Here's one case which shows
                                these odd hiccups:
                                https://i.imgur.com/CXBPjTK.gifv
                                <https://i.imgur.com/CXBPjTK.gifv>.

                                In this GIF where I switch back and
                                forth between copy operations on 2
                                servers, I'm copying a 10GB dir full
                                of .apk and image files.

                                On server "hive" I'm copying
                                straight from the main disk to an
                                attached volume block (xfs). As you
                                can see, the transfers are
                                relatively speedy and don't hiccup.
                                On server "citadel" I'm copying the
                                same set of data to a 4-replicate
                                gluster which uses block storage as
                                a brick. As you can see, performance
                                is much worse, and there are
                                frequent pauses for many seconds
                                where nothing seems to be happening
                                - just freezes.

                                All 4 servers have the same specs,
                                and all of them have performance
                                issues with gluster and no such
                                issues when raw xfs block storage is
                                used.

                                hive has long finished copying the
                                data, while citadel is barely
                                chugging along and is expected to
                                take probably half an hour to an
                                hour. I have over 1TB of data to
                                migrate, at which point if we went
                                live, I'm not even sure gluster
                                would be able to keep up instead of
                                bringing the machines and services down.



                                Here's the cluster config, though it
                                didn't seem to make any difference
                                performance-wise before I applied
                                the customizations vs after.

                                Volume Name: apkmirror_data1
                                Type: Replicate
                                Volume ID:
                                11ecee7e-d4f8-497a-9994-ceb144d6841e
                                Status: Started
                                Snapshot Count: 0
                                Number of Bricks: 1 x 4 = 4
                                Transport-type: tcp
                                Bricks:
                                Brick1:
                                nexus2:/mnt/nexus2_block1/apkmirror_data1
                                Brick2:
                                forge:/mnt/forge_block1/apkmirror_data1
                                Brick3:
                                hive:/mnt/hive_block1/apkmirror_data1
                                Brick4:
                                citadel:/mnt/citadel_block1/apkmirror_data1
                                Options Reconfigured:
                                cluster.quorum-count: 1
                                cluster.quorum-type: fixed
                                network.ping-timeout: 5
                                network.remote-dio: enable
                                performance.rda-cache-limit: 256MB
                                performance.readdir-ahead: on
                                performance.parallel-readdir: on
                                network.inode-lru-limit: 500000
                                performance.md-cache-timeout: 600
                                performance.cache-invalidation: on
                                performance.stat-prefetch: on
                                features.cache-invalidation-timeout: 600
                                features.cache-invalidation: on
                                cluster.readdir-optimize: on
                                performance.io-thread-count: 32
                                server.event-threads: 4
                                client.event-threads: 4
                                performance.read-ahead: off
                                cluster.lookup-optimize: on
                                performance.cache-size: 1GB
                                cluster.self-heal-daemon: enable
                                transport.address-family: inet
                                nfs.disable: on
                                performance.client-io-threads: on


                                The mounts are done as follows in
                                /etc/fstab:
                                
/dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
                                /mnt/citadel_block1 xfs defaults 0 2
                                localhost:/apkmirror_data1
                                /mnt/apkmirror_data1 glusterfs
                                defaults,_netdev 0 0

                                I'm really not sure if
                                direct-io-mode mount tweaks would do
                                anything here, what the value should
                                be set to, and what it is by default.

                                The OS is OpenSUSE 42.3, 64-bit.
                                80GB of RAM, 20 CPUs, hosted by Linode.

                                I'd really appreciate any help in
                                the matter.

                                Thank you.


                                Sincerely,
                                Artem

                                --
                                Founder, Android Police
                                <http://www.androidpolice.com>, APK
                                Mirror <http://www.apkmirror.com/>,
                                Illogical Robot LLC
                                beerpla.net <http://beerpla.net/> |
                                +ArtemRussakovskii
                                <https://plus.google.com/+ArtemRussakovskii>
                                | @ArtemR <http://twitter.com/ArtemR>

                                On Thu, Apr 5, 2018 at 11:13 PM,
                                Artem Russakovskii
                                <[email protected]
                                <mailto:[email protected]>> wrote:

                                    Hi,

                                    I'm trying to squeeze
                                    performance out of gluster on 4
                                    80GB RAM 20-CPU machines where
                                    Gluster runs on attached block
                                    storage (Linode) in (4 replicate
                                    bricks), and so far everything I
                                    tried results in sub-optimal
                                    performance.

                                    There are many files - mostly
                                    images, several million - and
                                    many operations take minutes,
                                    copying multiple files (even if
                                    they're small) suddenly freezes
                                    up for seconds at a time, then
                                    continues, iostat frequently
                                    shows large r_await and w_awaits
                                    with 100% utilization for the
                                    attached block device, etc.

                                    But anyway, there are many
                                    guides out there for small-file
                                    performance improvements, but
                                    more explanation is needed, and
                                    I think more tweaks should be
                                    possible.

                                    My question today is
                                    about performance.cache-size. Is
                                    this a size of cache in RAM? If
                                    so, how do I view the current
                                    cache size to see if it gets
                                    full and I should increase its
                                    size? Is it advisable to bump it
                                    up if I have many tens of gigs
                                    of RAM free?



                                    More generally, in the last 2
                                    months since I first started
                                    working with gluster and set a
                                    production system live, I've
                                    been feeling frustrated because
                                    Gluster has a lot of
                                    poorly-documented and confusing
                                    options. I really wish
                                    documentation could be improved
                                    with examples and better
                                    explanations.

                                    Specifically, it'd be absolutely
                                    amazing if the docs offered a
                                    strategy for setting each value
                                    and ways of determining more
                                    optimal values. For example,
                                    for performance.cache-size, if
                                    it said something like "run
                                    command abc to see your current
                                    cache size, and if it's hurting,
                                    up it, but be aware that it's
                                    limited by RAM," it'd be already
                                    a huge improvement to the docs.
                                    And so on with other options.



                                    The gluster team is quite
                                    helpful on this mailing list,
                                    but in a reactive rather than
                                    proactive way. Perhaps it's
                                    tunnel vision once you've worked
                                    on a project for so long where
                                    less technical explanations and
                                    even proper documentation of
                                    options takes a back seat, but I
                                    encourage you to be more
                                    proactive about helping us
                                    understand and optimize Gluster.

                                    Thank you.

                                    Sincerely,
                                    Artem

                                    --
                                    Founder, Android Police
                                    <http://www.androidpolice.com>,
                                    APK Mirror
                                    <http://www.apkmirror.com/>,
                                    Illogical Robot LLC
                                    beerpla.net
                                    <http://beerpla.net/> |
                                    +ArtemRussakovskii
                                    <https://plus.google.com/+ArtemRussakovskii>
                                    | @ArtemR
                                    <http://twitter.com/ArtemR>



                                _______________________________________________
                                Gluster-users mailing list
                                [email protected]
                                <mailto:[email protected]>
                                
http://lists.gluster.org/mailman/listinfo/gluster-users
                                
<http://lists.gluster.org/mailman/listinfo/gluster-users>








            _______________________________________________
            Gluster-users mailing list
            [email protected] <mailto:[email protected]>
            http://lists.gluster.org/mailman/listinfo/gluster-users
            <http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
[email protected]
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs

Reply via email to