On Fri, Oct 02, 2009 at 10:21:53AM +0200, James Brackinshaw wrote:
> On Fri, Oct 2, 2009 at 10:17 AM, Lars Ellenberg
> <[email protected]> wrote:
> > On Wed, Sep 30, 2009 at 02:22:32PM +0200, James Brackinshaw wrote:
> >> Hello,
> >>
> >> I have a two node heartbeat setup on Centos 5.3.
> >>
> >> The two nodes are in separate locations and connected only via
> >> ethernet. Because of this we require that a human guarantee that a
> >> node is dead before a switchover occurs. We use meatclient for this.
> >>
> >> Automatic failback is turned off. We would like the primary node to do
> >> all of the work unless we manually switch roles or the primary node
> >> dies.
> >>
> >> We recently had a network outage. We expected that the primary node
> >> would stay active and providing services. Instead, the two nodes
> >> switched roles while the network was being repaired.
> >>
> >> I cannot understand how the role switching happened since we ran no
> >> scripts manually (at least not at the start), and did not run
> >> meatclient.
> >>
> >> Can anyone help me understand why this happened?
> >
> > Connectivity came back.
> >
> >> I attach my log files.
> >
> > I did not have a look.
> >
> > But you are likely to find
> >  WARN node whatever-xy: is dead
> >  ...
> >  CRIT: Cluster node whatever-xy returning after partition.
> >  WARN: Deadtime value may be too small
> > ...
> >
> > this is handled by the cluster software by stopping all resources,
> > then starting on the "preferred" node.
> >
> 
> Thanks Lars. The first node is preferred. Services are on the first
> node to start with, after the split it migrated the services to the
> second node (which should not happen without a meatclient
> confirmation) and then back again. We used meatclient to avoid the
> situation, so what did we do wrong?

Then my explanation was not quite aplicable in this particular
situation, and your old heartbeat stuff does handle it all a bit
different, probably both nodes scheduling a shutdown and restart
once they recognized the rejoin after split,
but since one was already in the process of taking over resources,
just waiting for the confirmation of the other node being dead,
it "deferred" that shutdown until "current resource activity finished".

If you read your logs slowly, comparing time stamps,
it will explain what exactly happened.

I don't think you did something wrong.  Its just that
heartbeat has not much options to clean up the mess.

"meatware" is NOT there to prevent failovers,
but to confirm reset operations.
After the rejoin, there was nothing to confirm anymore,
as both nodes were able to talk to each other again.

Heartbeat (haresources) is not very flexible when handling
rejoins after partitions.

Pacemaker may or may not be able to handle such a situation
more to your liking, if configured appropriately.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to