Re: [Linux-HA] How to identify reason for fencing

Roman Haefeli Thu, 07 Feb 2013 01:19:35 -0800

On Wed, 2013-02-06 at 11:24 +0100, Michael Schwartzkopff wrote:
> Am Mittwoch, 6. Februar 2013, 11:06:23 schrieb Roman Haefeli:
> > Hi all
> > 
> > We are running a pacemaker/corosync cluster with three nodes that
> > manages ~30 OpenVZ containers.
> > 
> > We recently had the situation where one node fenced the to other two
> > nodes (sbd is configured as a stonith device). In the system logs I was
> > able to spot the line where the node gives the death pill to the others.
> > However, I have difficulties finding the original reason for the
> > decision to fence the other nodes.
> > 
> > Before I spam the list with logs, I'd like to ask if there is something
> > particular I should look for. Are there any advices about how to proceed
> > in such a situation?
> > 
> > Many thanks in advance.
> > 
> > Roman
> 
> The reasong should be in the logs above the fencing event. Something like
> 
> corosync: lost connection.
> 
> If you want help from the list paste your logs (the relevant parts only!) to 
> pastebin and mail the link.


I wasn't sure about which parts are the relevant. However, in the
meanwhile we were able to explain the situation. As often, it was a
whole chain of circumstances that eventually lead to fencing nodes.

Here the whole story (for those interested):

Each node has two NICs which form a network bond. On this bond there are
two VLANs configured, one for DMZ and one for internal use (corosync
ring and NFS traffic). Some containers have their virtual eth devices
bridged to the intern vlan for NFS access. Whenever a container starts
or stops, its veth device joins or leaves the bridge on the internal
vlan. This wouldn't generally be a problem, but it is when the kernel is
the Debian OpenVZ kernel. With this kernel bridges always use the MAC
address of the member with the smallest number as its MAC. When the
container's MAC address is smaller then the one of the physical NIC, the
MAC address of the bridge changes whenever that container is started or
stopped. This MAC switching caused network lags on the bridge where also
the corosync ring is connected. This finally made the corosync ring
break, which in turn lead to the fencing of two nodes.

Either of those would have prevented that:
* A non-Debian OpenVZ kernel (different scheme for assigning MACs to a
bridge)
* giving a higher MAC to the veth of the container
* running the corosync ring on its own vlan.

Roman
  






_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] How to identify reason for fencing

Reply via email to