Thanks again, I have tried to run a find over the cluster to try and trigger self-healing, but it's very slow so I don't have it running right now. If I check the same "ls /brick/folder" on all bricks, it takes less than 0.01 sec so I don't think any individual brick is causing the problem, performance on each brick seems to be normal. I think the issue is somewhere in the gluster internal communication as I believe FUSE mounted clients will try to communicate with all bricks. Unfortunately, I am not sure how to confirm this or narrow this down. Really struggling with this one now, it's starting to significantly impact our operations. I'm not sure what else I can try so appreciate any suggestions.
Thank you, - Patrick On Sun, Apr 21, 2019 at 11:50 PM Strahil <[email protected]> wrote: > Usually when this happens I run '/find /fuse/mount/point -exec stat {} > \;' from a client (using gluster with oVirt). > Yet, my scale is multiple times smaller and I don't know how this will > affect you (except it will trigger a heal). > > So the round-robin of the DNS clarifies the mystery .In such case, maybe > FUSE client is not the problem.Still it is worth trying a VM with the new > gluster version to mount the cluster. > > From the profile (took a short glance over it from my phone), not all > bricks are spending much of their time in LOOKUP. > Maybe your data is not evenly distributed? Is that ever possible ? > Sadly you can't rebalance untill all those heals are pending.(Maybe I'm > wrong) > > Have you checked the speed of 'ls /my/brick/subdir1/' on each brick ? > > Sadly, I'm just a gluster user, so take everything with a grain of salt. > > Best Regards, > Strahil Nikolov > On Apr 21, 2019 18:03, Patrick Rennie <[email protected]> wrote: > > I just tried to check my "gluster volume heal gvAA01 statistics" and it > doesn't seem like a full heal was still in progress, just an index, I have > started the full heal again and am trying to monitor it with "gluster > volume heal gvAA01 info" which just shows me thousands of gfid file > identifiers scrolling past. > What is the best way to check the status of a heal and track the files > healed and progress to completion? > > Thank you, > - Patrick > > On Sun, Apr 21, 2019 at 10:28 PM Patrick Rennie <[email protected]> > wrote: > > I think just worked out why NFS lookups are sometimes slow and sometimes > fast as the hostname uses round robin DNS lookups, if I change to a > specific host, 01-B, it's always quick, and if I change to the other brick > host, 02-B, it's always slow. > Maybe that will help to narrow this down? > > On Sun, Apr 21, 2019 at 10:24 PM Patrick Rennie <[email protected]> > wrote: > > Hi Strahil, > > Thank you for your reply and your suggestions. I'm not sure which logs > would be most relevant to be checking to diagnose this issue, we have the > brick logs, the cluster mount logs, the shd logs or something else? I have > posted a few that I have seen repeated a few times already. I will continue > to post anything further that I see. > I am working on migrating data to some new storage, so this will slowly > free up space, although this is a production cluster and new data is being > uploaded every day, sometimes faster than I can migrate it off. I have > several other similar clusters and none of them have the same problem, one > the others is actually at 98-99% right now (big problem, I know) but still > performs perfectly fine compared to this cluster, I am not sure low space > is the root cause here. > > I currently have 13 VMs accessing this cluster, I have checked each one > and all of them use one of the two options below to mount the cluster in > fstab > > HOSTNAME:/gvAA01 /mountpoint glusterfs > defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable,use-readdirp=no > 0 0 > HOSTNAME:/gvAA01 /mountpoint glusterfs > defaults,_netdev,rw,log-level=WARNING,direct-io-mode=disable > > I also have a few other VMs which use NFS to access the cluster, and these > machines appear to be significantly quicker, initially I get a similar > delay with NFS but if I cancel the first "ls" and try it again I get < 1 > sec lookups, this can take over 10 minutes by FUSE/gluster client, but the > same trick of cancelling and trying again doesn't work for FUSE/gluster. > Sometimes the NFS queries have no delay at all, so this is a bit strange to > me. > HOSTNAME:/gvAA01 /mountpoint/ nfs > defaults,_netdev,vers=3,async,noatime 0 0 > > Example: > user@VM:~$ time ls /cluster/folder > ^C > > real 9m49.383s > user 0m0.001s > sys 0m0.010s > > user@VM:~$ time ls /cluster/folder > <results> > > real 0m0.069s > user 0m0.001s > sys 0m0.007s > > --- > > I have checked the profiling as you suggested, I let it run for around a > minute, then cancelled it and saved the profile info. > > root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 start > Starting volume profile on gvAA01 has been successful > root@HOSTNAME:/var/log/glusterfs# time ls /cluster/folder > ^C > > real 1m1.660s > user 0m0.000s > sys 0m0.002s > > root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 info >> > ~/profile.txt > root@HOSTNAME:/var/log/glusterfs# gluster volume profile gvAA01 stop > > I will attach the results to this email as it's o > >
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
