Re: [Gluster-users] Issues in AFR and self healing

Ravishankar N Wed, 15 Aug 2018 21:08:06 -0700


On 08/15/2018 11:07 PM, Pablo Schandin wrote:

I found another log that I wasn't aware of in/var/log/glusterfs/brick, that is te mount log, I confused the logfiles. In this file I see a lot of entries like this one:
[2018-08-15 16:41:19.568477] I [addr.c:55:compare_addr_and_update]0-/mnt/brick1/gv1: allowed = "172.20.36.10", received addr ="172.20.36.11"[2018-08-15 16:41:19.568527] I [addr.c:55:compare_addr_and_update]0-/mnt/brick1/gv1: allowed = "172.20.36.11", received addr ="172.20.36.11"[2018-08-15 16:41:19.568547] I [login.c:76:gf_auth] 0-auth/login:allowed user names: 7107ccfa-0ba1-4172-aa5a-031568927bf1[2018-08-15 16:41:19.568564] I [MSGID: 115029][server-handshake.c:793:server_setvolume] 0-gv1-server: acceptedclient fromphysinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0(version: 3.1
2.6)
[2018-08-15 16:41:19.582710] I [MSGID: 115036][server.c:527:server_rpc_notify] 0-gv1-server: disconnectingconnection fromphysinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0[2018-08-15 16:41:19.582830] I [MSGID: 101055][client_t.c:443:gf_client_unref] 0-gv1-server: Shutting downconnectionphysinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0
So I see a lot of disconnections, right? This might be why the selfhealing is triggered all the time?

Not necessarily. These disconnects could also be due to the glfshealbinary which is invoked when you run `gluster vol heal volname info` etcand do not cause heals. It would be better to check your client mountlogs for disconnect messages like these:

[2018-08-16 03:59:32.289763] I [MSGID: 114018][client.c:2285:client_rpc_notify] 0-testvol-client-4: disconnected fromtestvol-client-0. Client process will keep trying to connect to glusterduntil brick's port is available

If there are no disconnects and you are still seeing files undergoingheal, then you might want to check the brick logs to see if there areany write failures.

Thanks,
Ravi

Thanks!

Pablo.

Avature

Get Engaged to Talent



On 08/14/2018 09:15 AM, Pablo Schandin wrote:
Thanks for the info!
I cannot see any logs in the mount log besides one line every time itrotates
[2018-08-13 06:25:02.246187] I[glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change involfile,continuing
But I did find in the glfsheal-gv1.log of the volumes some kind ofserver-client connection that was disconnected and now it connectsusing a different port. The block of log per each run is kind of longso I'm copying it into a pastebin.
https://pastebin.com/bp06rrsT

Maybe this has something to do with it?

Thanks!

Pablo.

On 08/11/2018 12:19 AM, Ravishankar N wrote:
On 08/10/2018 11:25 PM, Pablo Schandin wrote:
Hello everyone!
I'm having some trouble with something but I'm not quite sure ofwith what yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I havetwo servers (nodes) in the cluster in a replica mode. Each serverhas 2 bricks. As the servers are KVM running several VMs, one brickhas some VMs locally defined in it and the second brick is thereplicated from the other server. It has data but not actualwriting is being done except for the replication.
Server 1 Server 2Volume 1 (gv1): Brick 1 defined VMs (read/write) ----> Brick 1 replicated qcow2 filesVolume 2 (gv2): Brick 2 replicated qcow2 files <----- Brick 2 defined VMs (read/write)
So, the main issue arose when I got a nagios alarm that warnedabout a file listed to be healed. And then it disappeared. I cameto find out that every 5 minutes, the self heal daemon triggers thehealing and this fixes it. But looking at the logs I have a lot ofentries in the glustershd.log file like this:
[2018-08-09 14:23:37.689403] I [MSGID: 108026][afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0:Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c.sources=[0] sinks=1[2018-08-09 14:44:37.933143] I [MSGID: 108026][afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0:Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f.sources=[0] sinks=1
The qcow2 files are being healed several times a day (up to 30 inoccasions). As I understand, this means that a data heal occurredon file with gfid 407b... and 7371... in source to sink. Localserver to replica server? Is it OK for the shd to heal files in thereplicated brick that supposedly has no writing on it besides themirroring? How does that work?
In AFR, for writes, there is no notion of local/remote brick. Nomatter from which client you write to the volume, it gets sent toboth bricks. i.e. the replication is synchronous and real time.
How does afr replication work? The file with gfid 7371... is theqcow2 root disk of an owncloud server with 17GB of data. It doesnot seem to be that big to be a bottleneck of some sort, I think.
Also, I was investigating the directory tree inbrick/.glusterfs/indices and I notices that both in xattrop anddirty I always have a file created named xattrop-xxxxxx anddirty-xxxxxx. I read that the xattrop file is like a parent file orhandle to reference other files created there as hardlinks withgfid name for the shd to heal. Is the same case as the ones in thedirty dir?
Yes, before the write, the gfid gets captured inside dirty on allbricks. If the write is successful, it gets removed. In addition, ifthe write fails on one brick, the other brick will capture the gfidinside xattrop.
Any help will be greatly appreciated it. Thanks!
If frequent heals are triggered, it could mean there are frequentnetwork disconnects from the clients to the bricks as writes happen.You can check the mount logs and see if that is the case andinvestigate possible network issues.
HTH,
Ravi
Pablo.





_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Issues in AFR and self healing

Reply via email to