Re: [Gluster-users] Heal-failed - what does it really tell us?

prmarino1 Thu, 23 Jul 2015 17:52:06 -0700

You had a split brain at one point.
RHEV adds a dimension to this‎ which is interesting.
I have run into this before it probably happened during an update to the 
gluster servers or a sequential restart of the gluster process or servers.


So first thing there is a nasty cron daily job which is created by a package 
included in the Red Hat base that runs a yum update every day. This is one of 
the many reasons why my production kickstarts are always nobase installs.

The big reason this happens with RHEV is if a node is rebooted or the gluster 
server processes are ‎restarted and an other node in a 2 brick cluster has the 
same thing happen too quickly. Essentially what happens while a self heal 
operation is happening the second node which is the master source goes offline 
and instead of fensing the volume the client fails over to the incomplete copy.
‎The result is actually a split brain‎ but the funny thing when you add RHEV 
into the mix is every thing keeps working so unless you are using a tool like 
splunk or a properly configured logwatch cron job on your syslog server you 
never know any thing is wrong till you restart gluster on one of the servers.

So you did have a split brain you just didn't know it.
The easiest way to prevent this is to have a 3 replica brick structure on your 
volumes and have tighter controls on when reboots, process restarts, and 
updates happen.
‎
We have a replica 2, where the second node was freshly added about a 
week ago and as fas as I can tell is fully replicated. This is storage 
for a RHEV cluster and the total space currently in use is about 3.5TB.

When I run "gluster v heal gluster-rhev info heal-failed" it currently 
lists 866 files on the original and 1 file on the recently added node. 
What I find most interesting is that the single file listed on the 
second node is a lease file belonging to a VM template.

Some obvious questions come to mind: What is that output supposed to 
mean? Dose it in fact even have a useful meaning at all? How can the 
files be in a heal-failed condition and not also be in a split-brain 
condition?

My interpretation of "heal-failed" is that the listed files are not yet 
fully in sync across nodes (and are therefore by definition in a 
split-brain condition) but that doesn't match the output of the command. 
However, that can't be the same as the gluster interpretation because 
how can a template file which has received no reads or writes possibly 
be in a heal-failed condition a week after the initial volume heal?

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Heal-failed - what does it really tell us?

Reply via email to