Re: [Gluster-users] poor performance

Jaco Kroon Sun, 18 Dec 2022 23:16:19 -0800

Hi Joe,

I've read your blogs extensively and frequently reference it tocorrelate my own findings with. It has been one of the better sourcesof information over the years. Sorry for the really, really long emailbelow, but I reckon it's required at this stage to explain what's goingon from what we can see.

Some of the replies I've received are of the form "use VMs for servingcontent and use glusterfs for the backing store only", the problem withthis is that running 1000+ VMs for websites that in some cases don'texactly serve more than 10 users a day is an extreme waste ofresources. In particular with respect to RAM. docker may limit theimpact, but that's more complex to achieve.

varnish and squid only really helps if the content is set to be cached,otherwise all requests hit the backend servers anyway. That said, yes,we should deploy varnish/squid as a reverse proxy at some point, soperhaps this should be step one. So effectively haproxy =>varnish/squid => haproxy => apache/php (probably second haproxy can beeliminated since varnish/squid should know how to load balance betweenmultiple back-end servers, plus SSL can then be offloaded away fromapache too).

None of this solves the underlying problem though: with nl-cacheperformance is good (enough), but filesystem is inconsistent, withoutnl-cache, performance is terrible to the point where we are consideringshelving redundancy. Merely migrating to VMs doesn't actually solve theredundancy problem as your VM still remains the single point of failureat this point.

One consideration could be made to rather use docker instancespotentially. Such that there is exactly one docker instance per virtualhost, but I'm not sure this solves the performance issue in that eachdocker instance will still need to access the filesystem, so unless Ican export a *block* device via gfapi (as per KVM, but that's too RAMintensive since it requires a VM per virtual host, each with at least1GB RAM that adds up to at least 1TB of RAM per physical node that willbe required, and I'm fairly certain CPU will be significantly increasedtoo).

One other solution currently being contemplated is to use lsync torather use a cold standby host compared to a load-balanced setup. Switch-over will have to be manual, and the risks w.r.t. dataconsistency (how up to date the standby is) is also not something Ireally want to contemplate. This would allow us to leave most of therest of the configuration in tact. Here however lies the problem as perthe github page:

"synchronize a local directory tree with low profile of expected changesto a remote mirror." ... this is definitely NOT low profile.

First prise: Sort the filesystem inconsistency with using nl-cache, orat least dramatically reduce the time-period of the inconsistency frominfinite to a relatively short period (eg, 30 to 60 seconds).

Second prise: Get close to nl-cache performance without nl-cache. Thisdoesn't seem feasible whilst still using php.

Third prise: sort out php to not have as many negative filesystemhits. realpath_cache_size doesn't seem to make sufficient difference,default incidentally is no longer 16KB but 64KB (and combine withrealpath_cache_ttl=120 default, up to say 86400), so I'm guessing I canpush this for 512KB or even 1MB, so spend 1-2GB of RAM on this. Mayneed to also switch the php-fpm process manager to keep per-vhostprocesses around for longer but this isn't a major concern, we've got areasonable amount of RAM available. Unless this realpath_cache ispersistent over multiple php-fpm processes.

https://pecl.php.net/apcu just came onto my radar now, can definitelyalso investigate that. APC itself is dead from the looks of it. Looking at the docs though, the mechanism to avoid that stat() call isno longer present either. And the primary goal of avoiding the stat()call was to avoid self-heal (which is nowadays off on glusterfs side bydefault anyway). So not sure this will make a significant difference.

Otherwise, that specific blog entry has been read through so many timesby myself I can mostly recall the recommendations from memory. Youstill reference glusterfs 3.2.6 ... we're at 10.2, and we're runningwith an extra inode-table-size patch by yours truly which helps avoidlock contention when you have >64k files in the active set. Othertricks and hacks too such as limiting the invalidate-size to 16 or 32(recommendations currently seem to be in the 128-256 region but we foundthat anything over 32 if lru-limit >> inode-table-size is simplyuntennable, at 16 we pretty much avoid all latency spikes with thecaveat that it's quite possible for the number of entries in the inodetable to exceeed lru-limit for reasonable periods of time, but we reasonthat's just an indicator that you should probably be inreasinglru-limit, and quite possibly inode-table-size too - patches ongithub). The recommendation regarding RDMA over Infiniband is also nolonger possible, since infiniband support in glusterfs has been abandoned.

One other option that has not been mentioned is to use cluster-lvm andbasically export PVs from glusterfs, which can then be sectored intoCluster-aware VGs, such that they're only active on one node at a time,and then run some posix filesystem directly on those, and basicallyretain the current setup otherwise, with the caveat that each vhost willbe active only on one specific node, which will mean we will need arelevant mechanism to ensure that all requests for the vhost always hitsthe right physical node.


Kind Regards,
Jaco

On 2022/12/14 17:37, Joe Julian wrote:

PHP is not a good filesystem user. I've written about this a whileback:https://joejulian.name/post/optimizing-web-performance-with-glusterfs/


On December 14, 2022 6:16:54 AM PST, Jaco Kroon <j...@uls.co.za> wrote:

    Hi Peter,

    Yes, we could.  but with ~1000 vhosts that gets extremely
    cumbersome to maintain and get clients to be able to manage their
    own stuff.  Essentially except if the htdocs/ folder is on a
    single filesystem we're going to need to get involved with each
    and every update, which isn't feasible. Then I'd rather partition
    the vhosts such that half runs on one server and the other half on
    the other server and risk downtime.

    Our experience indicates that the slow part is in fact not the
    execution of the php code but for php to locate the files.  It
    tries a bunch of folders with stat() and/or open() and gets the
    ordering wrong, resulting numerous ENOENT errors before hitting
    the right locations, after which it actually does quite well.  On
    code I wrote which does NOT suffer this problem quite as badly as
    wordpress we find that from a local filesystem we get 200ms on
    full processing (idle system, nvme physical disk, although I doubt
    this matters since the fs layer should have most of this cached in
    RAM anyway) vs 300ms on top of glusterfs. The bricks barely ever
    goes to disk (fs layer caching) according to the system stats we
    gathered.

    How does big hosting entities like wordpress.org (iirc) deal with
    this?  Because honestly, I doubt they do single-server setups. 
    Then again, I reckon that if you ONLY host wordpress (based on
    experience) it's possible to have a single master copy of
    wordpress on each server, with a lsync'ed themes/ folder for each
    vhost and a shared (glusterfs) uploads folder.  Enters things like
    wordfence that insists on being able to write to alternative
    locations.

    Anyway, barring using glusterfs we can certainly come up with
    solutions, which may even include having *some* sites run on the
    shared setup, and others on single-host, possibly with lsync
    keeping a "semi hot standby" up to date with something like
    lsync.  That does get complex though.

    Our ideal solution remains a fairly performant clustered
    filesystem such as glusterfs (with which we have a lot of
    experience, including using it for large email clusters where it's
    performance is excellent, but I would have LOVED inotify
    support).  With nl-cache the performance is adequate, however, the
    cache-invalidation doesn't seem to function properly.  Which I
    believe can be solved, either by fixing settings, or by fixing
    code bugs.  Basically whenver a file is modified or a new file is
    created, clients should be alerted in order to invalidate cache. 
    Since this cluster is mostly-read, some write, and there is only
    two clients, this should be perfectly manageable, and there seems
    to be hints of this in the gluster volume options already:

    # gluster volume get volname all | grep invalid
    performance.quick-read-cache-invalidation false (DEFAULT)
    performance.ctime-invalidation           false (DEFAULT)
    performance.cache-invalidation on
    performance.global-cache-invalidation    true (DEFAULT)
    features.cache-invalidation on
    features.cache-invalidation-timeout 600

    Kind Regards,
    Jaco

    On 2022/12/14 14:56, Péter Károly JUHÁSZ wrote:

    We did this with WordPress too. It uses a tons of static files,
    executing them is the slow part. You can rsync them and use the
    upload dir from glusterfs.

    Jaco Kroon <j...@uls.co.za> 于 2022年12月14日周三 13:20写道：

        Hi,

        The problem is files generated by wordpress, and uploads etc
        ... so copying them to frontend hosts whilst making perfect
        sense assumes I have control over the code to not write to
        the local front-end, else we could have relied on something
        like lsync.

        As it stands, performance is acceptable with nl-cache
        enabled, but the fact that we get those ENOENT errors are
        highly problematic.


        Kind Regards,
        Jaco Kroon


        n 2022/12/14 14:04, Péter Károly JUHÁSZ wrote:

        When we used glusterfs for websites, we copied the web dir
        from gluster to local on frontend boots, then served it from
        there.

        Jaco Kroon <j...@uls.co.za> 于 2022年12月14日周三 12:49写道：

            Hi All,

            We've got a glusterfs cluster that houses some php web
            sites.

            This is generally considered a bad idea and we can see why.

            With performance.nl-cache on it actually turns out to be
            very
            reasonable, however, with this turned of performance is
            roughly 5x
            worse.  meaning a request that would take sub 500ms now
            takes 2500ms.
            In other cases we see far, far worse cases, eg, with
            nl-cache takes
            ~1500ms, without takes ~30s (20x worse).

            So why not use nl-cache?  Well, it results in readdir
            reporting files
            which then fails to open with ENOENT.  The cache also
            never clears even
            though the configuration says nl-cache entries should
            only be cached for
            60s.  Even for "ls -lah" in affected folders you'll
            notice ???? mark
            entries for attributes on files.  If this recovers in a
            reasonable time
            (say, a few seconds, sure).

            # gluster volume info
            Type: Replicate
            Volume ID: cbe08331-8b83-41ac-b56d-88ef30c0f5c7
            Status: Started
            Snapshot Count: 0
            Number of Bricks: 1 x 2 = 2
            Transport-type: tcp
            Options Reconfigured:
            performance.nl-cache: on
            cluster.readdir-optimize: on
            config.client-threads: 2
            config.brick-threads: 4
            config.global-threading: on
            performance.iot-pass-through: on
            storage.fips-mode-rchecksum: on
            cluster.granular-entry-heal: enable
            cluster.data-self-heal-algorithm: full
            cluster.locking-scheme: granular
            client.event-threads: 2
            server.event-threads: 2
            transport.address-family: inet
            nfs.disable: on
            cluster.metadata-self-heal: off
            cluster.entry-self-heal: off
            cluster.data-self-heal: off
            cluster.self-heal-daemon: on
            server.allow-insecure: on
            features.ctime: off
            performance.io-cache: on
            performance.cache-invalidation: on
            features.cache-invalidation: on
            performance.qr-cache-timeout: 600
            features.cache-invalidation-timeout: 600
            performance.io-cache-size: 128MB
            performance.cache-size: 128MB

            Are there any other recommendations short of abandon all
            hope of
            redundancy and to revert to a single-server setup (for
            the web code at
            least).  Currently the cost of the redundancy seems to
            outweigh the benefit.

            Glusterfs version 10.2.  With patch for
            --inode-table-size, mounts
            happen with:

            /usr/sbin/glusterfs --acl --reader-thread-count=2
            --lru-limit=524288
            --inode-table-size=524288 --invalidate-limit=16
            --background-qlen=32
            --fuse-mountopts=nodev,nosuid,noexec,noatime
            --process-name fuse
            --volfile-server=127.0.0.1 --volfile-id=gv_home
            --fuse-mountopts=nodev,nosuid,noexec,noatime /home

            Kind Regards,
            Jaco

            ________



            Community Meeting Calendar:

            Schedule -
            Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
            Bridge: https://meet.google.com/cpu-eiue-hvk
            Gluster-users mailing list
            Gluster-users@gluster.org
            https://lists.gluster.org/mailman/listinfo/gluster-users

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] poor performance

Reply via email to