Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Ravishankar N Sun, 29 Mar 2020 21:23:09 -0700


On 29/03/20 9:40 am, Erik Jacobson wrote:

Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

  If all leaders are up, we see no errors.


Here are example nfs.log errors:


[2020-03-29 03:42:52.295532] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. 
[Input/output error]

Since you say that the errors go away when all 3 bricks (which I guessis what you refer to as 'leaders') of the replica are up, it could bepossible that the brick you brought down had the only good copy. In suchcases, even though you have the other 2 bricks of the replica up, theyboth are bad copies waiting to be healed and hence all operations onthose files will fail with EIO. Since you say this occurs under highload only. I suspect this is the case since heal hasn't had the time tocatch up with the nodes going up and down.

If you see the split-brain errors despite all 3 replica bricks beingonline and the gnfs server being able to connect to all of them, then itcould be a genuine split-brain problem. But I don't think that is thecase here.


Regards,
Ravi

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Reply via email to