Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

2014-06-24 Thread Lars Ellenberg
On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
 
 On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:
 
  Hi,
  
  I understand that initially the split-brain is caused by heartbeat 
  messaging layer and there is nothing much can be done when packets are 
  dropped. However, the problem is sometimes when the load is gone (or when 
  iptables allows all traffic in my test setup), it doesn't recover.
  
  In the second case I provided, the heartbeat on both nodes did find each 
  other and both were active, but pacemaker in both nodes still thinks peer 
  is offline. I don't know if this is heartbeat's problem or Pacemaker's 
  problem though.
 
 Do you see any messages from 'crmd' saying the node left/returned?
 If you only see the node going away, then its almost certainly a heartbeat 
 problem.
 
 You may have better luck with a corosync based cluster, or even a newer 
 version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
 
 I was never all that happy with heartbeat's membership code, it was a 
 near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
 Corosync membership had its problems in the beginning, but personally I take 
 comfort in the fact that its actively being worked on.
 Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years 
 ago.

Possibly.  But especially with nodes
unexpectedly returning after having been declared dead,
I've still seen more problems with corosync than with heartbeat,
even within the last few years.

Anyways:
Andrew is right, you should use (recent!) corosync and recent pacemaker.
And working node level fencing aka stonith.

That said, you said earlier you are using heartbeat 3.0.5,
and that heartbeat successfully re-established membership.
So you can confirm ccm_testclient on both nodes reports
the expected and same membership?

Is that 3.0.5 release tag, or a more recent hg checkout?
You need heartbeat up to at least this commit:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6

(I meant to add a 3.0.6 release tag since at least I pushed that commit,
but because of packaging inconsistencies I want to fix,
and other commitments, I deferred that much too long).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

2014-06-24 Thread fank
Hi Andrew,

I do see the last status update from crmd as following on node-1 from crmd is 
but crm_mon -1 still shows node-0 offline:
crmd_ha_status_callback: Status update: Node node-0 now has status [active] 
[DC=false]
Same on node-0 showing node-1 now has status active but crm_mon -1 shows it 
offline.

Thanks,
-Kaiwei

- Original Message -
From: Andrew Beekhof and...@beekhof.net
To: General Linux-HA mailing list linux-ha@lists.linux-ha.org
Sent: Monday, June 23, 2014 7:23:30 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node
cluster


On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:

 Hi,
 
 I understand that initially the split-brain is caused by heartbeat messaging 
 layer and there is nothing much can be done when packets are dropped. 
 However, the problem is sometimes when the load is gone (or when iptables 
 allows all traffic in my test setup), it doesn't recover.
 
 In the second case I provided, the heartbeat on both nodes did find each 
 other and both were active, but pacemaker in both nodes still thinks peer is 
 offline. I don't know if this is heartbeat's problem or Pacemaker's problem 
 though.

Do you see any messages from 'crmd' saying the node left/returned?
If you only see the node going away, then its almost certainly a heartbeat 
problem.

You may have better luck with a corosync based cluster, or even a newer version 
of pacemaker (or both! the 1.0.x codebase is quite old at this point).

I was never all that happy with heartbeat's membership code, it was a 
near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
Corosync membership had its problems in the beginning, but personally I take 
comfort in the fact that its actively being worked on.
Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

 
 Thanks,
 -Kaiwei
 
 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Sent: Sunday, June 22, 2014 3:45:00 PM
 Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node  
 cluster
 
 
 On 21 Jun 2014, at 5:18 am, f...@vmware.com wrote:
 
 Hi,
 
 New to this list and hope I can get some help here.
 
 I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm 
 having split-brain problem when heartbeat messages sometimes get dropped 
 when system is under high load. However the problem is it never recover back 
 when system load became low.
 
 I created a test setup to test this by setting dead time to 6 seconds, and 
 continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 
 seconds and resume the traffic for 1~2 seconds using iptables. After the 
 system got into split-brain state, I stop the test and allow all heartbeat 
 traffic to go through. Sometimes the system recovered by sometimes it 
 didn't. There are various symptoms when the system didn't recovered from 
 split-brain:
 
 1. In one instance, cl_status listnodes becomes empty. The syslog keeps 
 showing
 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.warning] [2853]: WARN: 
 Message hist queue is filling up (436 messages in queue)
 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
 hist-ackseq =12111
 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
 hist-lowseq =12111, hist-hiseq=12547
 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
 expecting from node-1
 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
 it's ackseq=12111
 
 2. In another instance, cl_status nodestatus node shows both nodes are 
 active, but crm_mon -1 shows that each of the two nodes thinks itself is 
 the DC, and peer node is offline. Pengine process is running on one node 
 only. The node not running pengine (but still thinks itself is DC) has log 
 shows crmd terminated pengine because it detected peer is active. After 
 that, the peer status keeps flapping between dead and active, but pengine 
 has never being started again. The last log shows the peer is active (after 
 I stopped the test and allow all traffic). However crm_mon -1 shows itself 
 is the DC and peer is offline as:
 
 [root@node-1 ~]# crm_mon -1
 
 Last updated: Fri Jun 20 19:12:23 2014
 Stack: Heartbeat
 Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with 
 quorum
 Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
 2 Nodes configured, unknown expected votes
 1 Resources configured.
 
 
 Online: [ node-1 ]
 OFFLINE: [ node-0 ]
 
 cluster (heartbeat:ha):  Started node-1
 
 
 Any help, like pointer to the source code where the problem might be, or any 
 existing bug filed for this (I did some search but didn't find matched 
 symptoms) is appreciated.
 
 This is happening at the heartbeat level.
 
 Not much pacemaker can do I'm afraid.  Perhaps look to see if heartbeat is 
 real time 

Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

2014-06-24 Thread fank
Hi Lars,

Thanks for pointing out the patch. It is not in the heartbeat version on the 
system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out.

As for ccm_testclient, the system has stripped out unnecessary files that won't 
be used during normal operation, including gcc. So ccm_testclient complains gcc 
not found and I cannot test it on that system. cl_status listnodes shows both 
nodes on both system, cl_status nodestatus shows both are active thought.

Thanks,
-Kaiwei

- Original Message -
From: Lars Ellenberg lars.ellenb...@linbit.com
To: linux-ha@lists.linux-ha.org
Sent: Tuesday, June 24, 2014 7:03:47 AM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
 
 On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:
 
  Hi,
  
  I understand that initially the split-brain is caused by heartbeat 
  messaging layer and there is nothing much can be done when packets are 
  dropped. However, the problem is sometimes when the load is gone (or when 
  iptables allows all traffic in my test setup), it doesn't recover.
  
  In the second case I provided, the heartbeat on both nodes did find each 
  other and both were active, but pacemaker in both nodes still thinks peer 
  is offline. I don't know if this is heartbeat's problem or Pacemaker's 
  problem though.
 
 Do you see any messages from 'crmd' saying the node left/returned?
 If you only see the node going away, then its almost certainly a heartbeat 
 problem.
 
 You may have better luck with a corosync based cluster, or even a newer 
 version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
 
 I was never all that happy with heartbeat's membership code, it was a 
 near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
 Corosync membership had its problems in the beginning, but personally I take 
 comfort in the fact that its actively being worked on.
 Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years 
 ago.

Possibly.  But especially with nodes
unexpectedly returning after having been declared dead,
I've still seen more problems with corosync than with heartbeat,
even within the last few years.

Anyways:
Andrew is right, you should use (recent!) corosync and recent pacemaker.
And working node level fencing aka stonith.

That said, you said earlier you are using heartbeat 3.0.5,
and that heartbeat successfully re-established membership.
So you can confirm ccm_testclient on both nodes reports
the expected and same membership?

Is that 3.0.5 release tag, or a more recent hg checkout?
You need heartbeat up to at least this commit:
https://urldefense.proofpoint.com/v1/url?u=http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=bfe2739553e81279f7bfcb0c4e7667e1fc737ff5d0fa016e3552a25a1909e5aa

(I meant to add a 3.0.6 release tag since at least I pushed that commit,
but because of packaging inconsistencies I want to fix,
and other commitments, I deferred that much too long).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting 
https://urldefense.proofpoint.com/v1/url?u=http://www.linbit.com/k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=4ce0568e77f615cdc2dfa71544ea0f2b6b41ad41a3b76be0c23aca62ff8012a2

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-hak=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=d8439db8b91239efc9cd43b95f7742a3cd58752c27541d3a05cbbe69bcba7554
See also: 
https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblemsk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=9IPI1z37RqWr21klX9jnPw%3D%3D%0Am=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0As=1ae6e0ac95afd9dc1a458b9388124199400d06ba87166c19986d350985c5b851
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

2014-06-24 Thread Andrew Beekhof

On 25 Jun 2014, at 12:03 am, Lars Ellenberg lars.ellenb...@linbit.com wrote:

 On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
 
 On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:
 
 Hi,
 
 I understand that initially the split-brain is caused by heartbeat 
 messaging layer and there is nothing much can be done when packets are 
 dropped. However, the problem is sometimes when the load is gone (or when 
 iptables allows all traffic in my test setup), it doesn't recover.
 
 In the second case I provided, the heartbeat on both nodes did find each 
 other and both were active, but pacemaker in both nodes still thinks peer 
 is offline. I don't know if this is heartbeat's problem or Pacemaker's 
 problem though.
 
 Do you see any messages from 'crmd' saying the node left/returned?
 If you only see the node going away, then its almost certainly a heartbeat 
 problem.
 
 You may have better luck with a corosync based cluster, or even a newer 
 version of pacemaker (or both! the 1.0.x codebase is quite old at this 
 point).
 
 I was never all that happy with heartbeat's membership code, it was a 
 near-abandoned mystery box even at the point I started Pacemaker 10 years 
 ago.
 Corosync membership had its problems in the beginning, but personally I take 
 comfort in the fact that its actively being worked on.
 Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years 
 ago.
 
 Possibly.  But especially with nodes
 unexpectedly returning after having been declared dead,
 I've still seen more problems with corosync than with heartbeat,
 even within the last few years.

Unfortunately a fair share of those have also been pacemaker bugs :(
Yan is working on another one related to slow fencing devices.

 
 Anyways:
 Andrew is right, you should use (recent!) corosync and recent pacemaker.
 And working node level fencing aka stonith.
 
 That said, you said earlier you are using heartbeat 3.0.5,
 and that heartbeat successfully re-established membership.
 So you can confirm ccm_testclient on both nodes reports
 the expected and same membership?
 
 Is that 3.0.5 release tag, or a more recent hg checkout?
 You need heartbeat up to at least this commit:
 http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6
 
 (I meant to add a 3.0.6 release tag since at least I pushed that commit,
 but because of packaging inconsistencies I want to fix,
 and other commitments, I deferred that much too long).
 
 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems