Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote: On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: Hi, I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover. In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though. Do you see any messages from 'crmd' saying the node left/returned? If you only see the node going away, then its almost certainly a heartbeat problem. You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point). I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago. Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on. Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago. Possibly. But especially with nodes unexpectedly returning after having been declared dead, I've still seen more problems with corosync than with heartbeat, even within the last few years. Anyways: Andrew is right, you should use (recent!) corosync and recent pacemaker. And working node level fencing aka stonith. That said, you said earlier you are using heartbeat 3.0.5, and that heartbeat successfully re-established membership. So you can confirm ccm_testclient on both nodes reports the expected and same membership? Is that 3.0.5 release tag, or a more recent hg checkout? You need heartbeat up to at least this commit: http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6 (I meant to add a 3.0.6 release tag since at least I pushed that commit, but because of packaging inconsistencies I want to fix, and other commitments, I deferred that much too long). -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
Hi Andrew, I do see the last status update from crmd as following on node-1 from crmd is but crm_mon -1 still shows node-0 offline: crmd_ha_status_callback: Status update: Node node-0 now has status [active] [DC=false] Same on node-0 showing node-1 now has status active but crm_mon -1 shows it offline. Thanks, -Kaiwei - Original Message - From: Andrew Beekhof and...@beekhof.net To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Monday, June 23, 2014 7:23:30 PM Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: Hi, I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover. In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though. Do you see any messages from 'crmd' saying the node left/returned? If you only see the node going away, then its almost certainly a heartbeat problem. You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point). I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago. Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on. Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago. Thanks, -Kaiwei - Original Message - From: Andrew Beekhof and...@beekhof.net To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Sunday, June 22, 2014 3:45:00 PM Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster On 21 Jun 2014, at 5:18 am, f...@vmware.com wrote: Hi, New to this list and hope I can get some help here. I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low. I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain: 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue) 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist-ackseq =12111 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist-lowseq =12111, hist-hiseq=12547 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111 2. In another instance, cl_status nodestatus node shows both nodes are active, but crm_mon -1 shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However crm_mon -1 shows itself is the DC and peer is offline as: [root@node-1 ~]# crm_mon -1 Last updated: Fri Jun 20 19:12:23 2014 Stack: Heartbeat Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ node-1 ] OFFLINE: [ node-0 ] cluster (heartbeat:ha): Started node-1 Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated. This is happening at the heartbeat level. Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is real time
Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
Hi Lars, Thanks for pointing out the patch. It is not in the heartbeat version on the system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out. As for ccm_testclient, the system has stripped out unnecessary files that won't be used during normal operation, including gcc. So ccm_testclient complains gcc not found and I cannot test it on that system. cl_status listnodes shows both nodes on both system, cl_status nodestatus shows both are active thought. Thanks, -Kaiwei - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, June 24, 2014 7:03:47 AM Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote: On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: Hi, I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover. In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though. Do you see any messages from 'crmd' saying the node left/returned? If you only see the node going away, then its almost certainly a heartbeat problem. You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point). I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago. Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on. Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago. Possibly. But especially with nodes unexpectedly returning after having been declared dead, I've still seen more problems with corosync than with heartbeat, even within the last few years. Anyways: Andrew is right, you should use (recent!) corosync and recent pacemaker. And working node level fencing aka stonith. That said, you said earlier you are using heartbeat 3.0.5, and that heartbeat successfully re-established membership. So you can confirm ccm_testclient on both nodes reports the expected and same membership? Is that 3.0.5 release tag, or a more recent hg checkout? You need heartbeat up to at least this commit: https://urldefense.proofpoint.com/v1/url?u=http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=bfe2739553e81279f7bfcb0c4e7667e1fc737ff5d0fa016e3552a25a1909e5aa (I meant to add a 3.0.6 release tag since at least I pushed that commit, but because of packaging inconsistencies I want to fix, and other commitments, I deferred that much too long). -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting https://urldefense.proofpoint.com/v1/url?u=http://www.linbit.com/k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=4ce0568e77f615cdc2dfa71544ea0f2b6b41ad41a3b76be0c23aca62ff8012a2 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-hak=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=d8439db8b91239efc9cd43b95f7742a3cd58752c27541d3a05cbbe69bcba7554 See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblemsk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=1ae6e0ac95afd9dc1a458b9388124199400d06ba87166c19986d350985c5b851 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
On 25 Jun 2014, at 12:03 am, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote: On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: Hi, I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover. In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though. Do you see any messages from 'crmd' saying the node left/returned? If you only see the node going away, then its almost certainly a heartbeat problem. You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point). I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago. Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on. Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago. Possibly. But especially with nodes unexpectedly returning after having been declared dead, I've still seen more problems with corosync than with heartbeat, even within the last few years. Unfortunately a fair share of those have also been pacemaker bugs :( Yan is working on another one related to slow fencing devices. Anyways: Andrew is right, you should use (recent!) corosync and recent pacemaker. And working node level fencing aka stonith. That said, you said earlier you are using heartbeat 3.0.5, and that heartbeat successfully re-established membership. So you can confirm ccm_testclient on both nodes reports the expected and same membership? Is that 3.0.5 release tag, or a more recent hg checkout? You need heartbeat up to at least this commit: http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6 (I meant to add a 3.0.6 release tag since at least I pushed that commit, but because of packaging inconsistencies I want to fix, and other commitments, I deferred that much too long). -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: Message signed with OpenPGP using GPGMail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems