(As
an aside, we're using VLAN mode at Nimbis).
Take
care,
Lorin
--
Lorin
Hochstein
Lead Architect - Cloud Services
Nimbis
Services, Inc.
TL;DR
To
fix issues with failed dhcp leases in vlan mode, upgrade to dnsmasq
2.6.1[1]
THE LONG VERSION
There
is an issue with the way nova uses dnsmasq in VLAN mode. It starts up a
single copy of dnsmasq for each vlan on the network host (or on every
host in multi_host mode). The problem is in the way that dnsmasq binds
to an ip address and port[2]. Both copies can respond to broadcast
packet, but unicast packets can only be answered by one of the copies.
In
nova this means that guests from only one project will get responses to
their unicast dhcp renew requests. Unicast projects from guests in
other projects get ignored. What happens next is different depending on
the guest os. Linux generally will send a broadcast packet out after
the unicast fails, and so the only effect is a small (tens of ms) hiccup
while interface is reconfigured. It can be much worse than that,
however. I have seen cases where Windows just gives up and ends up with a
non-configured interface.
This bug was first
noticed by some users of openstack who rolled their own fix. Basically,
on linux, if you set the SO_BINDTODEVICE socket option, it will allow
different daemons to share the port and respond to unicast packets, as
long as they listen on different interfaces. I managed to communicate
with Simon Kelley, the maintainer of dnsmasq and he has integrated a
fix[3] for the issue in the current version[1] of dnsmaq.
I
don't know how may users out there are using vlan mode, but you should
be able to deal with this issue by upgrading dnsmasq. It would be great
if the various distributionss could upgrade as well, or at least try to
patch in the fix[3]. If upgrading dnsmasq is out of the question, a
possible workaround is to minimize lease renewals with something like
the following combination of config options.
#
release leases immediately on terminate
force_dhcp_release=true
#
one week lease time
dhcp_lease_time=604800
# two
week disassociate timeout
fixed_ip_disassociate_timeout=1209600
Vish