On 25 Jun 2014, at 12:03 am, Lars Ellenberg <lars.ellenb...@linbit.com> wrote:

> On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
>> 
>> On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:
>> 
>>> Hi,
>>> 
>>> I understand that initially the split-brain is caused by heartbeat 
>>> messaging layer and there is nothing much can be done when packets are 
>>> dropped. However, the problem is sometimes when the load is gone (or when 
>>> iptables allows all traffic in my test setup), it doesn't recover.
>>> 
>>> In the second case I provided, the heartbeat on both nodes did find each 
>>> other and both were active, but pacemaker in both nodes still thinks peer 
>>> is offline. I don't know if this is heartbeat's problem or Pacemaker's 
>>> problem though.
>> 
>> Do you see any messages from 'crmd' saying the node left/returned?
>> If you only see the node going away, then its almost certainly a heartbeat 
>> problem.
>> 
>> You may have better luck with a corosync based cluster, or even a newer 
>> version of pacemaker (or both! the 1.0.x codebase is quite old at this 
>> point).
>> 
>> I was never all that happy with heartbeat's membership code, it was a 
>> near-abandoned mystery box even at the point I started Pacemaker 10 years 
>> ago.
>> Corosync membership had its problems in the beginning, but personally I take 
>> comfort in the fact that its actively being worked on.
>> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years 
>> ago.
> 
> Possibly.  But especially with nodes
> "unexpectedly returning after having been declared dead",
> I've still seen more problems with corosync than with heartbeat,
> even within the last few years.

Unfortunately a fair share of those have also been pacemaker bugs :(
Yan is working on another one related to slow fencing devices.

> 
> Anyways:
> Andrew is right, you should use (recent!) corosync and recent pacemaker.
> And working node level fencing aka stonith.
> 
> That said, you said earlier you are using heartbeat 3.0.5,
> and that heartbeat successfully re-established membership.
> So you can confirm "ccm_testclient" on both nodes reports
> the expected and same membership?
> 
> Is that 3.0.5 release tag, or a more "recent" hg checkout?
> You need heartbeat up to at least this commit:
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6
> 
> (I meant to add a 3.0.6 release tag since at least I pushed that commit,
> but because of packaging inconsistencies I want to fix,
> and other commitments, I deferred that much too long).
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to