Re: [Gluster-users] Extremely slow cluster performance

Darrell Budic Sat, 20 Apr 2019 08:58:21 -0700

See inline:

> On Apr 20, 2019, at 10:09 AM, Patrick Rennie <[email protected]> wrote:
> 
> Hi Darrell, 
> 
> Thanks for your reply, this issue seems to be getting worse over the last few 
> days, really has me tearing my hair out. I will do as you have suggested and 
> get started on upgrading from 3.12.14 to 3.12.15. 
> I've checked the zfs properties and all bricks have "xattr=sa" set, but none 
> of them has "acltype=posixacl" set, currently the acltype property shows 
> "off", if I make these changes will it apply retroactively to the existing 
> data? I'm unfamiliar with what this will change so I may need to look into 
> that before I proceed.


It is safe to apply that now, any new set/get calls will then use it if new 
posixacls exist, and use older if not. ZFS is good that way. It should clear up 
your posix_acl and posix errors over time.

> I understand performance is going to slow down as the bricks get full, I am 
> currently trying to free space and migrate data to some newer storage, I have 
> fresh several hundred TB storage I just setup recently but with these 
> performance issues it's really slow. I also believe there is significant data 
> which has been deleted directly from the bricks in the past, so if I can 
> reclaim this space in a safe manner then I will have at least around 10-15% 
> free space. 

Full ZFS volumes will have a much larger impact on performance than you’d 
think, I’d prioritize this. If you have been taking zfs snapshots, consider 
deleting them to get the overall volume free space back up. And just to be sure 
it’s been said, delete from within the mounted volumes, don’t delete directly 
from the bricks (gluster will just try and heal it later, compounding your 
issues). Does not apply to deleting other data from the ZFS volume if it’s not 
part of the brick directory, of course.

> These servers have dual 8 core Xeon (E5-2620v4) and 512GB of RAM so generally 
> they have plenty of resources available, currently only using around 
> 330/512GB of memory.
> 
> I will look into what your suggested settings will change, and then will 
> probably go ahead with your recommendations, for our specs as stated above, 
> what would you suggest for performance.io-thread-count ?

I run single 2630v4s on my servers, which have a smaller storage footprint than 
yours. I’d go with 32 for performance.io <http://performance.io/>-thread-count. 
I’d try 4 for the shd thread settings on that gear. Your memory use sounds 
fine, so no worries there.

> Our workload is nothing too extreme, we have a few VMs which write backup 
> data to this storage nightly for our clients, our VMs don't live on this 
> cluster, but just write to it. 

If they are writing compressible data, you’ll get immediate benefit by setting 
compression=lz4 on your ZFS volumes. It won’t help any old data, of course, but 
it will compress new data going forward. This is another one that’s safe to 
enable on the fly.

> I've been going through all of the logs I can, below are some slightly 
> sanitized errors I've come across, but I'm not sure what to make of them. The 
> main error I am seeing is the first one below, across several of my bricks, 
> but possibly only for specific folders on the cluster, I'm not 100% about 
> that yet though. 
> 
> [2019-04-20 05:56:59.512649] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
> 0-gvAA01-posix: getxattr failed on /brick7/xxxxxxxxxxxxxxxxxxxx: 
> system.posix_acl_default  [Operation not supported]
> [2019-04-20 05:59:06.084333] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
> 0-gvAA01-posix: getxattr failed on /brick7/xxxxxxxxxxxxxxxxxxxx: 
> system.posix_acl_default  [Operation not supported]
> [2019-04-20 05:59:43.289030] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
> 0-gvAA01-posix: getxattr failed on /brick7/xxxxxxxxxxxxxxxxxxxx: 
> system.posix_acl_default  [Operation not supported]
> [2019-04-20 05:59:50.582257] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
> 0-gvAA01-posix: getxattr failed on /brick7/xxxxxxxxxxxxxxxxxxxx: 
> system.posix_acl_default  [Operation not supported]
> [2019-04-20 06:01:42.501701] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
> 0-gvAA01-posix: getxattr failed on /brick7/xxxxxxxxxxxxxxxxxxxx: 
> system.posix_acl_default  [Operation not supported]
> [2019-04-20 06:01:51.665354] W [posix.c:4929:posix_getxattr] 0-gvAA01-posix: 
> Extended attributes not supported (try remounting brick with 'user_xattr' 
> flag)
> 
> 
> [2019-04-20 13:12:36.131856] E [MSGID: 113002] 
> [posix-helpers.c:893:posix_gfid_set] 0-gvAA01-posix: gfid is null for 
> /xxxxxxxxxxxxxxxxxxxx [Invalid argument]
> [2019-04-20 13:12:36.131959] E [MSGID: 113002] [posix.c:362:posix_lookup] 
> 0-gvAA01-posix: buf->ia_gfid is null for 
> /brick2/xxxxxxxxxxxxxxxxxxxx_62906_tmp [No data available]
> [2019-04-20 13:12:36.132016] E [MSGID: 115050] 
> [server-rpc-fops.c:175:server_lookup_cbk] 0-gvAA01-server: 24274759: LOOKUP 
> /xxxxxxxxxxxxxxxxxxxx (a7c9b4a0-b7ee-4d01-a79e-576013c8ac87/Cloud 
> Backup_clone1.vbm_62906_tmp), client: 
> 00-A-16217-2019/04/08-21:23:03:692424-gvAA01-client-4-0-3, error-xlator: 
> gvAA01-posix [No data available]
> [2019-04-20 13:12:38.093719] E [MSGID: 115050] 
> [server-rpc-fops.c:175:server_lookup_cbk] 0-gvAA01-server: 24276491: LOOKUP 
> /xxxxxxxxxxxxxxxxxxxx (a7c9b4a0-b7ee-4d01-a79e-576013c8ac87/Cloud 
> Backup_clone1.vbm_62906_tmp), client: 
> 00-A-16217-2019/04/08-21:23:03:692424-gvAA01-client-4-0-3, error-xlator: 
> gvAA01-posix [No data available]
> [2019-04-20 13:12:38.093660] E [MSGID: 113002] 
> [posix-helpers.c:893:posix_gfid_set] 0-gvAA01-posix: gfid is null for 
> /xxxxxxxxxxxxxxxxxxxx [Invalid argument]
> [2019-04-20 13:12:38.093696] E [MSGID: 113002] [posix.c:362:posix_lookup] 
> 0-gvAA01-posix: buf->ia_gfid is null for /brick2/xxxxxxxxxxxxxxxxxxxx [No 
> data available]
> 

posixacls should clear those up, as mentioned.

> 
> [2019-04-20 14:25:59.654576] E [inodelk.c:404:__inode_unlock_lock] 
> 0-gvAA01-locks:  Matching lock not found for unlock 0-9223372036854775807, by 
> 980fdbbd367f0000 on 0x7fc4f0161440
> [2019-04-20 14:25:59.654668] E [MSGID: 115053] 
> [server-rpc-fops.c:295:server_inodelk_cbk] 0-gvAA01-server: 6092928: INODELK 
> /xxxxxxxxxxxxxxxxxxxx.cdr$ (25b14631-a179-4274-8243-6e272d4f2ad8), client: 
> cb-per-worker18-53637-2019/04/19-14:25:37:927673-gvAA01-client-1-0-4, 
> error-xlator: gvAA01-locks [Invalid argument]
> 
> 
> [2019-04-20 13:35:07.495495] E [rpcsvc.c:1364:rpcsvc_submit_generic] 
> 0-rpc-service: failed to submit message (XID: 0x247c644, Program: GlusterFS 
> 3.3, ProgVers: 330, Proc: 27) to rpc-transport (tcp.gvAA01-server)
> [2019-04-20 13:35:07.495619] E [server.c:195:server_submit_reply] 
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/debug/io-stats.so(+0x1696a)
>  [0x7ff4ae6f796a] 
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/protocol/server.so(+0x2d6e8)
>  [0x7ff4ae2a96e8] 
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/protocol/server.so(+0x928d)
>  [0x7ff4ae28528d] ) 0-: Reply submission failed
> 

Fix the posix acls and see if these clear up over time as well, I’m unclear on 
what the overall effect of running without the posix acls will be to total 
gluster health. Your biggest problem sounds like you need to free up space on 
the volumes and get the overall volume health back up to par and see if that 
doesn’t resolve the symptoms you’re seeing.


> 
> Thank you again for your assistance. It is greatly appreciated. 
> 
> - Patrick
> 
> 
> 
> On Sat, Apr 20, 2019 at 10:50 PM Darrell Budic <[email protected] 
> <mailto:[email protected]>> wrote:
> Patrick,
> 
> I would definitely upgrade your two nodes from 3.12.14 to 3.12.15. You also 
> mention ZFS, and that error you show makes me think you need to check to be 
> sure you have “xattr=sa” and “acltype=posixacl” set on your ZFS volumes.
> 
> You also observed your bricks are crossing the 95% full line, ZFS performance 
> will degrade significantly the closer you get to full. In my experience, this 
> starts somewhere between 10% and 5% free space remaining, so you’re in that 
> realm. 
> 
> How’s your free memory on the servers doing? Do you have your zfs arc cache 
> limited to something less than all the RAM? It shares pretty well, but I’ve 
> encountered situations where other things won’t try and take ram back 
> properly if they think it’s in use, so ZFS never gets the opportunity to give 
> it up.
> 
> Since your volume is a disperse-replica, you might try tuning 
> disperse.shd-max-threads, default is 1, I’d try it at 2, 4, or even more if 
> the CPUs are beefy enough. And setting server.event-threads to 4 and 
> client.event-threads to 8 has proven helpful in many cases. After you get 
> upgraded to 3.12.15, enabling performance.stat-prefetch may help as well. I 
> don’t know if it matters, but I’d also recommend resetting 
> performance.least-prio-threads to the default of 1 (or try 2 or 4) and/or 
> also setting performance.io <http://performance.io/>-thread-count to 32 if 
> those have beefy CPUs.
> 
> Beyond those general ideas, more info about your hardware (CPU and RAM) and 
> workload (VMs, direct storage for web servers or enders, etc) may net you 
> some more ideas. Then you’re going to have to do more digging into brick logs 
> looking for errors and/or warnings to see what’s going on.
> 
>   -Darrell
> 
> 
>> On Apr 20, 2019, at 8:22 AM, Patrick Rennie <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello Gluster Users, 
>> 
>> I am hoping someone can help me with resolving an ongoing issue I've been 
>> having, I'm new to mailing lists so forgive me if I have gotten anything 
>> wrong. We have noticed our performance deteriorating over the last few 
>> weeks, easily measured by trying to do an ls on one of our top-level 
>> folders, and timing it, which usually would take 2-5 seconds, and now takes 
>> up to 20 minutes, which obviously renders our cluster basically unusable. 
>> This has been intermittent in the past but is now almost constant and I am 
>> not sure how to work out the exact cause. We have noticed some errors in the 
>> brick logs, and have noticed that if we kill the right brick process, 
>> performance instantly returns back to normal, this is not always the same 
>> brick, but it indicates to me something in the brick processes or background 
>> tasks may be causing extreme latency. Due to this ability to fix it by 
>> killing the right brick process off, I think it's a specific file, or 
>> folder, or operation which may be hanging and causing the increased latency, 
>> but I am not sure how to work it out. One last thing to add is that our 
>> bricks are getting quite full (~95% full), we are trying to migrate data off 
>> to new storage but that is going slowly, not helped by this issue. I am 
>> currently trying to run a full heal as there appear to be many files needing 
>> healing, and I have all brick processes running so they have an opportunity 
>> to heal, but this means performance is very poor. It currently takes over 
>> 15-20 minutes to do an ls of one of our top-level folders, which just 
>> contains 60-80 other folders, this should take 2-5 seconds. This is all 
>> being checked by FUSE mount locally on the storage node itself, but it is 
>> the same for other clients and VMs accessing the cluster. Initially, it 
>> seemed our NFS mounts were not affected and operated at normal speed, but 
>> testing over the last day has shown that our NFS clients are also extremely 
>> slow, so it doesn't seem specific to FUSE as I first thought it might be. 
>> 
>> I am not sure how to proceed from here, I am fairly new to gluster having 
>> inherited this setup from my predecessor and trying to keep it going. I have 
>> included some info below to try and help with diagnosis, please let me know 
>> if any further info would be helpful. I would really appreciate any advice 
>> on what I could try to work out the cause. Thank you in advance for reading 
>> this, and any suggestions you might be able to offer. 
>> 
>> - Patrick
>> 
>> This is an example of the main error I see in our brick logs, there have 
>> been others, I can post them when I see them again too:
>> [2019-04-20 04:54:43.055680] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 
>> 0-gvAA01-posix: getxattr failed on /brick1/<filename> library: 
>> system.posix_acl_default  [Operation not supported]
>> [2019-04-20 05:01:29.476313] W [posix.c:4929:posix_getxattr] 0-gvAA01-posix: 
>> Extended attributes not supported (try remounting brick with 'user_xattr' 
>> flag)
>> 
>> Our setup consists of 2 storage nodes and an arbiter node. I have noticed 
>> our nodes are on slightly different versions, I'm not sure if this could be 
>> an issue. We have 9 bricks on each node, made up of ZFS RAIDZ2 pools - total 
>> capacity is around 560TB. 
>> We have bonded 10gbps NICS on each node, and I have tested bandwidth with 
>> iperf and found that it's what would be expected from this config. 
>> Individual brick performance seems ok, I've tested several bricks using dd 
>> and can write a 10GB files at 1.7GB/s. 
>> 
>> # dd if=/dev/zero of=/brick1/test/test.file bs=1M count=10000
>> 10000+0 records in
>> 10000+0 records out
>> 10485760000 bytes (10 GB, 9.8 GiB) copied, 6.20303 s, 1.7 GB/s
>> 
>> Node 1:
>> # glusterfs --version
>> glusterfs 3.12.15
>> 
>> Node 2:
>> # glusterfs --version
>> glusterfs 3.12.14
>> 
>> Arbiter:
>> # glusterfs --version
>> glusterfs 3.12.14
>> 
>> Here is our gluster volume status:
>> 
>> # gluster volume status
>> Status of volume: gvAA01
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick 01-B:/brick1/gvAA01/brick    49152     0          Y       7219
>> Brick 02-B:/brick1/gvAA01/brick    49152     0          Y       21845
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck1                                         49152     0          Y       6931
>> Brick 01-B:/brick2/gvAA01/brick    49153     0          Y       7239
>> Brick 02-B:/brick2/gvAA01/brick    49153     0          Y       9916
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck2                                         49153     0          Y       6939
>> Brick 01-B:/brick3/gvAA01/brick    49154     0          Y       7235
>> Brick 02-B:/brick3/gvAA01/brick    49154     0          Y       21858
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck3                                         49154     0          Y       6947
>> Brick 01-B:/brick4/gvAA01/brick    49155     0          Y       31840
>> Brick 02-B:/brick4/gvAA01/brick    49155     0          Y       9933
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck4                                         49155     0          Y       6956
>> Brick 01-B:/brick5/gvAA01/brick    49156     0          Y       7233
>> Brick 02-B:/brick5/gvAA01/brick    49156     0          Y       9942
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck5                                         49156     0          Y       6964
>> Brick 01-B:/brick6/gvAA01/brick    49157     0          Y       7234
>> Brick 02-B:/brick6/gvAA01/brick    49157     0          Y       9952
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck6                                         49157     0          Y       6974
>> Brick 01-B:/brick7/gvAA01/brick    49158     0          Y       7248
>> Brick 02-B:/brick7/gvAA01/brick    49158     0          Y       9960
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck7                                         49158     0          Y       6984
>> Brick 01-B:/brick8/gvAA01/brick    49159     0          Y       7253
>> Brick 02-B:/brick8/gvAA01/brick    49159     0          Y       9970
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck8                                         49159     0          Y       6993
>> Brick 01-B:/brick9/gvAA01/brick    49160     0          Y       7245
>> Brick 02-B:/brick9/gvAA01/brick    49160     0          Y       9984
>> Brick 00-A:/arbiterAA01/gvAA01/bri
>> ck9                                         49160     0          Y       7001
>> NFS Server on localhost                     2049      0          Y       
>> 17276
>> Self-heal Daemon on localhost               N/A       N/A        Y       
>> 25245
>> NFS Server on 02-B                 2049      0          Y       9089
>> Self-heal Daemon on 02-B           N/A       N/A        Y       17838
>> NFS Server on 00-a                 2049      0          Y       15660
>> Self-heal Daemon on 00-a           N/A       N/A        Y       16218
>> 
>> Task Status of Volume gvAA01
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>> 
>> And gluster volume info: 
>> 
>> # gluster volume info
>> 
>> Volume Name: gvAA01
>> Type: Distributed-Replicate
>> Volume ID: ca4ece2c-13fe-414b-856c-2878196d6118
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 9 x (2 + 1) = 27
>> Transport-type: tcp
>> Bricks:
>> Brick1: 01-B:/brick1/gvAA01/brick
>> Brick2: 02-B:/brick1/gvAA01/brick
>> Brick3: 00-A:/arbiterAA01/gvAA01/brick1 (arbiter)
>> Brick4: 01-B:/brick2/gvAA01/brick
>> Brick5: 02-B:/brick2/gvAA01/brick
>> Brick6: 00-A:/arbiterAA01/gvAA01/brick2 (arbiter)
>> Brick7: 01-B:/brick3/gvAA01/brick
>> Brick8: 02-B:/brick3/gvAA01/brick
>> Brick9: 00-A:/arbiterAA01/gvAA01/brick3 (arbiter)
>> Brick10: 01-B:/brick4/gvAA01/brick
>> Brick11: 02-B:/brick4/gvAA01/brick
>> Brick12: 00-A:/arbiterAA01/gvAA01/brick4 (arbiter)
>> Brick13: 01-B:/brick5/gvAA01/brick
>> Brick14: 02-B:/brick5/gvAA01/brick
>> Brick15: 00-A:/arbiterAA01/gvAA01/brick5 (arbiter)
>> Brick16: 01-B:/brick6/gvAA01/brick
>> Brick17: 02-B:/brick6/gvAA01/brick
>> Brick18: 00-A:/arbiterAA01/gvAA01/brick6 (arbiter)
>> Brick19: 01-B:/brick7/gvAA01/brick
>> Brick20: 02-B:/brick7/gvAA01/brick
>> Brick21: 00-A:/arbiterAA01/gvAA01/brick7 (arbiter)
>> Brick22: 01-B:/brick8/gvAA01/brick
>> Brick23: 02-B:/brick8/gvAA01/brick
>> Brick24: 00-A:/arbiterAA01/gvAA01/brick8 (arbiter)
>> Brick25: 01-B:/brick9/gvAA01/brick
>> Brick26: 02-B:/brick9/gvAA01/brick
>> Brick27: 00-A:/arbiterAA01/gvAA01/brick9 (arbiter)
>> Options Reconfigured:
>> cluster.shd-max-threads: 4
>> performance.least-prio-threads: 16
>> cluster.readdir-optimize: on
>> performance.quick-read: off
>> performance.stat-prefetch: off
>> cluster.data-self-heal: on
>> cluster.lookup-unhashed: auto
>> cluster.lookup-optimize: on
>> cluster.favorite-child-policy: mtime
>> server.allow-insecure: on
>> transport.address-family: inet
>> client.bind-insecure: on
>> cluster.entry-self-heal: off
>> cluster.metadata-self-heal: off
>> performance.md-cache-timeout: 600
>> cluster.self-heal-daemon: enable
>> performance.readdir-ahead: on
>> diagnostics.brick-log-level: INFO
>> nfs.disable: off
>> 
>> Thank you for any assistance. 
>> 
>> - Patrick
>> _______________________________________________
>> Gluster-users mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.gluster.org/mailman/listinfo/gluster-users 
>> <https://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Extremely slow cluster performance

Reply via email to