Re: [Gluster-users] Issues in AFR and self healing

Ravishankar N Fri, 10 Aug 2018 20:20:01 -0700


On 08/10/2018 11:25 PM, Pablo Schandin wrote:

Hello everyone!
I'm having some trouble with something but I'm not quite sure of withwhat yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I have twoservers (nodes) in the cluster in a replica mode. Each server has 2bricks. As the servers are KVM running several VMs, one brick has someVMs locally defined in it and the second brick is the replicated fromthe other server. It has data but not actual writing is being doneexcept for the replication.
Server 1 Server 2Volume 1 (gv1): Brick 1 defined VMs (read/write) ----> Brick 1 replicated qcow2 filesVolume 2 (gv2): Brick 2 replicated qcow2 files <----- Brick 2 defined VMs (read/write)
So, the main issue arose when I got a nagios alarm that warned about afile listed to be healed. And then it disappeared. I came to find outthat every 5 minutes, the self heal daemon triggers the healing andthis fixes it. But looking at the logs I have a lot of entries in theglustershd.log file like this:
[2018-08-09 14:23:37.689403] I [MSGID: 108026][afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0:Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c.sources=[0] sinks=1[2018-08-09 14:44:37.933143] I [MSGID: 108026][afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0:Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f.sources=[0] sinks=1
The qcow2 files are being healed several times a day (up to 30 inoccasions). As I understand, this means that a data heal occurred onfile with gfid 407b... and 7371... in source to sink. Local server toreplica server? Is it OK for the shd to heal files in the replicatedbrick that supposedly has no writing on it besides the mirroring? Howdoes that work?

In AFR, for writes, there is no notion of local/remote brick. No matterfrom which client you write to the volume, it gets sent to both bricks.i.e. the replication is synchronous and real time.

How does afr replication work? The file with gfid 7371... is the qcow2root disk of an owncloud server with 17GB of data. It does not seem tobe that big to be a bottleneck of some sort, I think.
Also, I was investigating the directory tree inbrick/.glusterfs/indices and I notices that both in xattrop and dirtyI always have a file created named xattrop-xxxxxx and dirty-xxxxxx. Iread that the xattrop file is like a parent file or handle toreference other files created there as hardlinks with gfid name forthe shd to heal. Is the same case as the ones in the dirty dir?

Yes, before the write, the gfid gets captured inside dirty on allbricks. If the write is successful, it gets removed. In addition, if thewrite fails on one brick, the other brick will capture the gfid insidexattrop.


Any help will be greatly appreciated it. Thanks!

If frequent heals are triggered, it could mean there are frequentnetwork disconnects from the clients to the bricks as writes happen. Youcan check the mount logs and see if that is the case and investigatepossible network issues.


HTH,
Ravi


Pablo.





_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Issues in AFR and self healing

Reply via email to