Thanks, Digimer. This is an existing setup so I'm stuck with them. Currently my workaround is to increase the dead time so it won't flap and cause all these issues.
Best, -Kaiwei ----- Original Message ----- From: "Digimer" <[email protected]> To: "General Linux-HA mailing list" <[email protected]> Sent: Friday, June 20, 2014 4:19:29 PM Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster On 20/06/14 03:18 PM, [email protected] wrote: > Hi, > > New to this list and hope I can get some help here. > > I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm > having split-brain problem when heartbeat messages sometimes get dropped when > system is under high load. However the problem is it never recover back when > system load became low. > > I created a test setup to test this by setting dead time to 6 seconds, and > continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 > seconds and resume the traffic for 1~2 seconds using iptables. After the > system got into split-brain state, I stop the test and allow all heartbeat > traffic to go through. Sometimes the system recovered by sometimes it didn't. > There are various symptoms when the system didn't recovered from split-brain: > > 1. In one instance, cl_status listnodes becomes empty. The syslog keeps > showing > 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: > Message hist queue is filling up (436 messages in queue) > 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: > hist->ackseq =12111 > 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: > hist->lowseq =12111, hist->hiseq=12547 > 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: > expecting from node-1 > 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: > it's ackseq=12111 > > 2. In another instance, cl_status nodestatus <node> shows both nodes are > active, but "crm_mon -1" shows that each of the two nodes thinks itself is > the DC, and peer node is offline. Pengine process is running on one node > only. The node not running pengine (but still thinks itself is DC) has log > shows crmd terminated pengine because it detected peer is active. After that, > the peer status keeps flapping between dead and active, but pengine has never > being started again. The last log shows the peer is active (after I stopped > the test and allow all traffic). However "crm_mon -1" shows itself is the DC > and peer is offline as: > > [root@node-1 ~]# crm_mon -1 > ============ > Last updated: Fri Jun 20 19:12:23 2014 > Stack: Heartbeat > Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with > quorum > Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3 > 2 Nodes configured, unknown expected votes > 1 Resources configured. > ============ > > Online: [ node-1 ] > OFFLINE: [ node-0 ] > > cluster (heartbeat:ha): Started node-1 > > > Any help, like pointer to the source code where the problem might be, or any > existing bug filed for this (I did some search but didn't find matched > symptoms) is appreciated. > > Thanks, > -Kaiwei Hi Kaiwei, Is this a new install? If so, that is some very old (and deprecated) software. If it is an existing install, then you might find it hard to get an answer here (but by all means, you might). Heartbeat hasn't been developed in a loooong time, and pacemaker 1.0.x is also very old. However, Linbit still offers commercial support for heartbeat. So if you don't get help here, you might want to drop them a line. Cheers, and best of luck. -- Digimer Papers and Projects: https://urldefense.proofpoint.com/v1/url?u=https://alteeve.ca/w/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=a38d98eb09db3aeadc08bbeb2eef3cfe6d1035281d0c72b6b1829ca318e2a0ec What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Linux-HA mailing list [email protected] https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=b8efd2bbc5af0a3fee47d7d684973f1cde015d58afb95cb9b8da2aac22deb621 See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=836651138e60dfb2c89c66cc63e1b229f2daf7b221cf06683fe184b347c1104d _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
