Re: [Linux-HA] how to diagnose stonith death match?

Andrew Beekhof Tue, 08 Jan 2013 18:15:53 -0800

On Wed, Jan 9, 2013 at 3:13 AM, Greg Woods <[email protected]> wrote:
> On Tue, 2013-01-08 at 09:18 +1100, Andrew Beekhof wrote:
>
>> > On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote:
>> >
>> >> The problem is that either node can come up and run all the resources,
>> >> but as soon as I bring the other node online, it briefly looks normal,
>> >> but as soon as the stonith resource starts, the currently running node
>> >> gets fenced and the new node takes over all the resources. Then the
>> >> fenced node comes up, fences the other node and takes over, etc. Death
>> >> match.
>
>> Thats odd. Normally its a firewall issue.  Did you happen to choose a
>> different port perhaps?
>
> Close, but not quite. I did finally figure out what was going on, as the
> death match started again as I was reconfiguring the cluster from
> scratch, but this time I knew more about what was causing it. It started
> as soon as I added "xend" as a resource. A little trial and error showed
> that the heartbeat does not work if it is on an interface that also has
> a Xen bridge attached to it. This is unexpected because all the other
> kinds of networking on that interface work fine with the bridge active
> (e.g. ssh connections, IPMI connections, etc.), only heartbeat is
> affected. But it was absolutely reproducible. If I started xend by hand
> instead of having it as a cluster resource, again I got a death match. A
> careful reading of the logs did show that heartbeat was declaring the
> other node dead. So for some reason, heartbeat communication was lost as
> soon as the bridge was activated.


IIRC, part of the activation involves tearing down the "normal"
interface and creating the bridge.
At this point the device heartbeat was talking to is gone.

> I got the cluster running with xend by
> moving the heartbeat to a different interface.

Having heartbeat start after the bridge is created _should_ also work.

> This is less than ideal
> because that interface is attached to a network that is also used for
> different things and has other hosts attached to it, but since this is
> only a test cluster, that's acceptable.
>
> --Greg
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] how to diagnose stonith death match?

Reply via email to