Oh, great!

In the meantime, I would _strongly_ recommend you start work on a migration plan. There is no danger in the foreseeable future as Linbit has no plans to end support, but the stack itself is quite deprecated.

The stack that all major distros have now settled on corosync + pacemaker, so this is what I would recommend moving towards. You'll probably find that the shift isn't very difficult, as pacemaker was born out of the heartbeat project.

Cheers

On 21/06/14 01:47 AM, f...@vmware.com wrote:
Thanks, Digimer. This is an existing setup so I'm stuck with them. Currently my 
workaround is to increase the dead time so it won't flap and cause all these 
issues.

Best,
-Kaiwei

----- Original Message -----
From: "Digimer" <li...@alteeve.ca>
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Friday, June 20, 2014 4:19:29 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node        
cluster

On 20/06/14 03:18 PM, f...@vmware.com wrote:
Hi,

New to this list and hope I can get some help here.

I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm 
having split-brain problem when heartbeat messages sometimes get dropped when 
system is under high load. However the problem is it never recover back when 
system load became low.

I created a test setup to test this by setting dead time to 6 seconds, and 
continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 
seconds and resume the traffic for 1~2 seconds using iptables. After the system 
got into split-brain state, I stop the test and allow all heartbeat traffic to 
go through. Sometimes the system recovered by sometimes it didn't. There are 
various symptoms when the system didn't recovered from split-brain:

1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.warning] [2853]: WARN: 
Message hist queue is filling up (436 messages in queue)
2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
hist->ackseq =12111
2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
hist->lowseq =12111, hist->hiseq=12547
2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
expecting from node-1
2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: it's 
ackseq=12111

2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon 
-1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine 
process is running on one node only. The node not running pengine (but still thinks itself is DC) has log 
shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps 
flapping between dead and active, but pengine has never being started again. The last log shows the peer 
is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself 
is the DC and peer is offline as:

[root@node-1 ~]# crm_mon -1
============
Last updated: Fri Jun 20 19:12:23 2014
Stack: Heartbeat
Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with 
quorum
Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ node-1 ]
OFFLINE: [ node-0 ]

   cluster     (heartbeat:ha):      Started node-1


Any help, like pointer to the source code where the problem might be, or any 
existing bug filed for this (I did some search but didn't find matched 
symptoms) is appreciated.

Thanks,
-Kaiwei

Hi Kaiwei,

    Is this a new install? If so, that is some very old (and deprecated)
software. If it is an existing install, then you might find it hard to
get an answer here (but by all means, you might). Heartbeat hasn't been
developed in a loooong time, and pacemaker 1.0.x is also very old.
However, Linbit still offers commercial support for heartbeat. So if you
don't get help here, you might want to drop them a line.

Cheers, and best of luck.



--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to