On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
> 
> On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:
> 
> > Hi,
> > 
> > I understand that initially the split-brain is caused by heartbeat 
> > messaging layer and there is nothing much can be done when packets are 
> > dropped. However, the problem is sometimes when the load is gone (or when 
> > iptables allows all traffic in my test setup), it doesn't recover.
> > 
> > In the second case I provided, the heartbeat on both nodes did find each 
> > other and both were active, but pacemaker in both nodes still thinks peer 
> > is offline. I don't know if this is heartbeat's problem or Pacemaker's 
> > problem though.
> 
> Do you see any messages from 'crmd' saying the node left/returned?
> If you only see the node going away, then its almost certainly a heartbeat 
> problem.
> 
> You may have better luck with a corosync based cluster, or even a newer 
> version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
> 
> I was never all that happy with heartbeat's membership code, it was a 
> near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
> Corosync membership had its problems in the beginning, but personally I take 
> comfort in the fact that its actively being worked on.
> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years 
> ago.

Possibly.  But especially with nodes
"unexpectedly returning after having been declared dead",
I've still seen more problems with corosync than with heartbeat,
even within the last few years.

Anyways:
Andrew is right, you should use (recent!) corosync and recent pacemaker.
And working node level fencing aka stonith.

That said, you said earlier you are using heartbeat 3.0.5,
and that heartbeat successfully re-established membership.
So you can confirm "ccm_testclient" on both nodes reports
the expected and same membership?

Is that 3.0.5 release tag, or a more "recent" hg checkout?
You need heartbeat up to at least this commit:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6

(I meant to add a 3.0.6 release tag since at least I pushed that commit,
but because of packaging inconsistencies I want to fix,
and other commitments, I deferred that much too long).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to