Hi,

I understand that initially the split-brain is caused by heartbeat messaging 
layer and there is nothing much can be done when packets are dropped. However, 
the problem is sometimes when the load is gone (or when iptables allows all 
traffic in my test setup), it doesn't recover.

In the second case I provided, the heartbeat on both nodes did find each other 
and both were active, but pacemaker in both nodes still thinks peer is offline. 
I don't know if this is heartbeat's problem or Pacemaker's problem though.

Thanks,
-Kaiwei

----- Original Message -----
From: "Andrew Beekhof" <[email protected]>
To: "General Linux-HA mailing list" <[email protected]>
Sent: Sunday, June 22, 2014 3:45:00 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node        
cluster


On 21 Jun 2014, at 5:18 am, [email protected] wrote:

> Hi,
> 
> New to this list and hope I can get some help here.
> 
> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm 
> having split-brain problem when heartbeat messages sometimes get dropped when 
> system is under high load. However the problem is it never recover back when 
> system load became low.
> 
> I created a test setup to test this by setting dead time to 6 seconds, and 
> continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 
> seconds and resume the traffic for 1~2 seconds using iptables. After the 
> system got into split-brain state, I stop the test and allow all heartbeat 
> traffic to go through. Sometimes the system recovered by sometimes it didn't. 
> There are various symptoms when the system didn't recovered from split-brain:
> 
> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps 
> showing
> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.warning] [2853]: WARN: 
> Message hist queue is filling up (436 messages in queue)
> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
> hist->ackseq =12111
> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
> hist->lowseq =12111, hist->hiseq=12547
> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
> expecting from node-1
> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
> it's ackseq=12111
> 
> 2. In another instance, cl_status nodestatus <node> shows both nodes are 
> active, but "crm_mon -1" shows that each of the two nodes thinks itself is 
> the DC, and peer node is offline. Pengine process is running on one node 
> only. The node not running pengine (but still thinks itself is DC) has log 
> shows crmd terminated pengine because it detected peer is active. After that, 
> the peer status keeps flapping between dead and active, but pengine has never 
> being started again. The last log shows the peer is active (after I stopped 
> the test and allow all traffic). However "crm_mon -1" shows itself is the DC 
> and peer is offline as:
> 
> [root@node-1 ~]# crm_mon -1
> ============
> Last updated: Fri Jun 20 19:12:23 2014
> Stack: Heartbeat
> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with 
> quorum
> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ node-1 ]
> OFFLINE: [ node-0 ]
> 
> cluster     (heartbeat:ha):      Started node-1
> 
> 
> Any help, like pointer to the source code where the problem might be, or any 
> existing bug filed for this (I did some search but didn't find matched 
> symptoms) is appreciated.

This is happening at the heartbeat level.

Not much pacemaker can do I'm afraid.  Perhaps look to see if heartbeat is 
"real time" scheduled, if not that may explain why its being staved of CPU and 
can't get its messages out.

> 
> Thanks,
> -Kaiwei
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
> See also: 
> https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6


_______________________________________________
Linux-HA mailing list
[email protected]
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
See also: 
https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to