On 29/03/20 9:40 am, Erik Jacobson wrote:
Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

  If all leaders are up, we see no errors.


Here are example nfs.log errors:


[2020-03-29 03:42:52.295532] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. 
[Input/output error]

Since you say that the errors go away when all 3 bricks (which I guess is what you refer to as 'leaders') of the replica are up, it could be possible that the brick you brought down had the only good copy. In such cases, even though you have the other 2 bricks of the replica up, they both are bad copies waiting to be healed and hence all operations on those files will fail with EIO. Since you say this occurs under high load only. I suspect this is the case since heal hasn't had the time to catch up with the nodes going up and down.

If you see the split-brain errors despite all 3 replica bricks being online and the gnfs server being able to connect to all of them, then it could be a genuine split-brain problem. But I don't think that is the case here.

Regards,
Ravi

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to