Hi guys,

I had been investigating a possible bug with the way interfaces are managed on 
virtual routers. The public interfaces are being brought up on backup routers 
and (because they boot second) they arp the IPs away from the master. I'd been 
examining an idea for a fix but whilst doing so I found that the system appears 
to be designed to bring up these interfaces.

I suspect that a few things need to be reworked - but the changes necessary go 
so far against what has been implemented that I wanted to open this up before 
doing the work.

Hopefully if I go through my findings you guys can help me see what I might be 
getting wrong.

The following was correct for pre-4.6 redundant routers:
 
  1. Both routers get configured with IP addresses, routes and iptables rules.
  2. Public interfaces are initially set as DOWN.
  3. Keepalived runs a VRRP instance on the private interface (eth0) to 
negotiate MASTER/BACKUP roles.
  4. Keepalived manages the virtual IP on eth0 used as the public gateway for 
the guest VMs.
  5. Keepalived uses a master notify script to bring up the public interfaces.

The above was true for pre-4.6 routers. Now, however, things appear to work 
differently:

  1. Both routers get configured as before.
  2. All interfaces apart from eth1 (the Hypervisor-link interface) are set as 
UP.
  3. Keepalived runs a VRRP instance on the first public interface (eth2) to 
negotiate MASTER/BACKUP roles.
  4. Keepalived manages the virtual IP as before.
  5. Keepalived uses a master notify script to bring up the public interfaces 
(unnecessary)
  6. Keepalived uses a backup notify script to bring down the public interfaces 
(unused)

This is unexpected for the following reasons:

  1. The keepalived notify script brings the public interfaces down when 
transitioning to BACKUP - so how can we expect to run a VRRP instance over eth2?
  2. If interfaces are down when transitioning to BACKUP, why are they not 
expected to be down to begin with? (Before the router has become MASTER)
  3. Why are we running a VRRP instance over an interface with an IP that will 
clash with another host on the network?

The following method from the CsIP class in /opt/cloud/bin/cs/CsAddress.py 
confuses matters futher:

    def check_is_up(self):
         """ Ensure device is up """
         cmd = "ip link show %s | grep 'state DOWN'" % self.getDevice()
         for i in CsHelper.execute(cmd):
             if " DOWN " in i:
                 cmd2 = "ip link set %s up" % self.getDevice()
                 # If redundant only bring up public interfaces that are not 
eth1.
                 # Reason: private gateways are public interfaces.
                 # master.py and keepalived will deal with eth1 public 
interface. 
                 if self.cl.is_redundant() and (not self.is_public() or 
self.getDevice() not in PUBLIC_INTERFACE):
                     CsHelper.execute(cmd2)
                 # if not redundant bring everything up
                 if not self.cl.is_redundant():
                     CsHelper.execute(cmd2)
 
The comments refer to eth1 as a public interface when this is the link to the 
hypervisor. Indeed, PUBLIC_INTERFACE is defined on line 31 as ['eth1']. But 
keepalived and master.py don't influence eth1 at all. This looks like a mistake.

Lastly, the logic of this line looks flawed:

  if self.cl.is_redundant() and (not self.is_public() or self.getDevice() not 
in PUBLIC_INTERFACE)

As PUBLIC_INTERFACE is limited to eth1, the `not self.is_public()` will be 
ignored. Public IPs will never be assigned to eth1, so this line evaluates as:

  
  if self.cl.is_redundant() and (self.getDevice() not in PUBLIC_INTERFACE)

which reduces even further to:

  if self.cs.is_redundant() and self.is_control()


What would need doing
---------------------

  1. The keepalived.conf template would need to be changed to run the VRRP 
instance over eth0.
  2. The check_is_up method of the CsIP class should be renamed to 
'bring_up_interfaces'. For redundant routers it should ignore IPs that pass 
is_public or needs_vrrp.
  3. The arpPing method should do nothing if the interface is down.
  4. The PUBLIC_INTERFACE constant should be either renamed or dropped 
altogether.
  5. Other things that I haven't considered?


I'd really appreciate any feedback on this. It's possible that I've got it all 
wrong but I'm suspecting not. I just don't want to tread on anyone's toes by 
submitting a PR that goes against what appears to be an explicit design 
decision.


Kind regards,

Dean Close
iCloudHosting.com
http://www.icloudhosting.com
Tel: 01582 227927

Unit 2, Smallmead Road, Reading RG2 0QS

******************************************************************
The names iCloudHosting and iCloudHosting.com are trading styles of BBS 
Commerce Ltd which is registered in England and Wales, Company Number 04837714. 
Please use our trading address above for mail. Our registered office is 5 
Theale Lakes Business Park, Moulden Way, Sulhamstead, Reading, Berkshire, RG7 
4GB. VAT Registration Number GB 982 8230 94.

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient you are not authorised to and must not 
disclose, copy, distribute, or retain this message or any part of it.

iCloudHosting accepts no responsibility for information, errors or omissions in 
this email.
******************************************************************

Reply via email to