Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

Chuck Carlino Wed, 28 Jan 2015 16:02:17 -0800

On 01/28/2015 12:51 PM, Kevin Benton wrote:

If we are going to ignore the IP address changing use-case, can wejust make the default infinity? Then nobody ever has to worry aboutcontrol plane outages for existing client. 24 hours is way too long tobe useful anyway.

Why would users want to change an active port's IP address anyway? I cansee possible use in changing an inactive port's IP address, but thatwouldn't cause the dhcp issues mentioned here. I worry about setting adefault config value to handle a very unusual use case.


Chuck

On Jan 28, 2015 12:44 PM, "Salvatore Orlando" <[email protected]<mailto:[email protected]>> wrote:




    On 28 January 2015 at 20:19, Brian Haley <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Kevin,

        On 01/28/2015 03:50 AM, Kevin Benton wrote:
        > Hi,
        >
        > Approximately a year and a half ago, the default DHCP lease
        time in Neutron was
        > increased from 120 seconds to 86400 seconds.[1] This was
        done with the goal of
        > reducing DHCP traffic with very little discussion (based on
        what I can see in
        > the review and bug report). While it it does indeed reduce
        DHCP traffic, I don't
        > think any bug reports were filed showing that a 120 second
        lease time resulted
        > in too much traffic or that a jump all of the way to 86400
        seconds was required
        > instead of a value in the same order of magnitude.
        >
        > Why does this matter?
        >
        > Neutron ports can be updated with a new IP address from the
        same subnet or
        > another subnet on the same network. The port update will
        result in anti-spoofing
        > iptables rule changes that immediately stop the old IP
        address from working on
        > the host. This means the host is unreachable for 0-12 hours
        based on the current
        > default lease time without manual intervention[2] (assuming
        half-lease length
        > DHCP renewal attempts).

        So I'll first comment on the problem.  You're essentially
        "pulling the rug" out
        from under these VMs by changing their IP (and that of their
        router and DHCP/DNS
        server), but you expect they should fail quickly and come
        right back online.  In
        a non-Neutron environment wouldn't the IT person that did this
        need some pretty
        good heat-resistant pants for all the flames from pissed-off
        users?  Sure, the
        guy on his laptop will just bounce the connection, but servers
        (aka VMs) should
        stay pretty static.  VMs are servers (and cows according to some).


    I actually expect this kind operation to not be one Neutron users
    will do very often, mostly because regardless of whether you're in
    the cloud or not, you'd still need to wear those heat resistant pants.


        The correct solution is to be able to renumber the network so
        there is no issue
        with the anti-spoofing rules dropping packets, or the VMs
        having an unreachable
        IP address, but that's a much bigger nut to crack.


    Indeed. In my opinion the "update IP" operation sets false
    expectations in users. I have considered disallowing PUT on
    fixed_ips in the past but that did not go ahead because there were
    users leveraging it.


        > Why is this on the mailing list?
        >
        > In an attempt to make the VMs usable in a much shorter
        timeframe following a
        > Neutron port address change, I submitted a patch to reduce
        the default DHCP
        > lease time to 8 minutes.[3] However, this was upsetting to
        several people,[4] so
        > it was suggested I bring this discussion to the mailing
        list. The following are
        > the high-level concerns followed by my responses:
        >
        >   * 8 minutes is arbitrary
        >       o Yes, but it's no more arbitrary than 1440 minutes. I
        picked it as an
        >         interval because it is still 4 times larger than the last 
short value,
        >         but it still allows VMs to regain connectivity in <5
        minutes in the
        >         event their IP is changed. If someone has a good
        suggestion for another
        >         interval based on known dnsmasq QPS limits or some
        other quantitative
        >         reason, please chime in here.

        We run 48 hours as the default in our public cloud, and I did
        some digging to
        remind myself of the multiple reasons:

        1. Too much DHCP traffic.  Sure, only that initial request is
        broadcast, but
        dnsmasq is very verbose and loves writing to syslog for
        everything it does -
        less is more.  Do a scale test with 10K VMs and you'll quickly
        find out a large
        portion of traffic is DHCP RENEWs, and syslog is huge.


    This is correct, and something I overlooked in my previous post.
    Nevertheless I still think that it is really impossible to find an
    optimal default which is regarded as such by every user. The
    current default has been chosen mostly for the reason you explain
    below, and I don't see a strong reason for changing it.


        2. During a control-plane upgrade or outage, having a short
        DHCP lease time will
        take all your VMs offline.  The old value of 2 minutes is not
        a realistic value

for an upgrade, and I don't think 8 minutes is much better.Yes, when DHCP is

        down you can't boot a new VM, but as long as customers can get
        to their existing
        VMs they're pretty happy and won't scream bloody murder.


    In our cloud we were continuously hit bit this. We could not take
    our dhcp agents out, otherwise all VMs would lose their leases,
    unless the downtime of the agent was very brief.


        There's probably more, but those were the top two, with #2
        being most important.


    Summarizing, I think that Kevin is exposing a real, albeit
    well-know problem (sorry about my dhcp release faux pas - I can
    use jet lag as a justification!), and he's proposing a mitigation
    to it. On the other hand, this mitigation, as Brian explains, is
    going to cause real operational issues. Still, we're arguing on
    the a default value for a configuration parameter. I therefore
    think the best thing that we can do is explicitly stating what
    happens when setting long or short lease times.
    I expected this to be documented in [1], but it's not. I think
    that place and neutron.conf might contain this kind of
    documentation, such as:

    # DHCP Lease duration (in seconds).
    # Use -1 to tell dnsmasq to use infinite lease times.
    # dhcp_lease_duration = 86400
    # Note that long DHCP leases will result in delays
    # in instances acquiring updated IP addresses. This
    # may result in downtime for those instance as anti
    # spoof policy will then block all traffic in and out of
    # them. In order to minimise this downtime window
    # the lease time should be shorter, for example
    # dhcp_lease_duration = 480

    However, I would not change the current system default, as this
    might affect operational systems.

    Apologies again for my stupid dhcp-release note,
    Salvatore

    [1] http://developer.openstack.org/api-ref-networking-v2.html


        >   * other datacenters use long lease times
        >       o This is true, but it's not really a valid
        comparison. In most regular
        >         datacenters, updating a static DHCP lease has no effect on 
the data
        >         plane so it doesn't matter that the client doesn't
        react for hours/days
        >         (even with DHCP snooping enabled). However, in
        Neutron's case, the
        >         security groups are immediately updated so all
        traffic using the old
        >         address is blocked.

        Yes, and choosing the lease time is a deployment decision that
        needs to take a
        lot of things into account.  Like I said, we don't even use
        the default.  The
        default should just be a good guess for a standard deployment,
        not a value that
        caters towards the edge cases, especially when the value is
        tunable in neutron.conf.

        >   * dhcp traffic is scary because it's broadcast
        >       o ARP traffic is also broadcast and many clients will
        expire entries every
        >         5-10 minutes and re-ARP. L2population may be used to prevent 
ARP
        >         propagation, so the comparison between DHCP and ARP
        isn't always
        >         relevant here.

        I don't recall anyone being scared of broadcast, and can't
        find any comments
        regarding it in https://review.openstack.org/#/c/150595/

        > Please reply back with your opinions/anecdotes/data related
        to short DHCP lease
        > times.

        I can only speculate on why 24 hours was chosen as the default
        back in 2013,
        possibly because a lot of wireless router firmware defaults
        are set as such?

        > 1.
        
https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
        > 2. Manual intervention could be an instance reboot, a dhcp
        client invocation via
        > the console, or a delayed invocation right before the
        update. (all significantly
        > more difficult to script than a simple update of a port's IP
        via the API).
        > 3. https://review.openstack.org/#/c/150595/
        > 4. http://i.imgur.com/xtvatkP.jpg

        I was a much bigger baby than that :)

        -Brian

        
__________________________________________________________________________
        OpenStack Development Mailing List (not for usage questions)
        Unsubscribe:
        [email protected]?subject:unsubscribe
        <http://[email protected]?subject:unsubscribe>
        http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe:
    [email protected]?subject:unsubscribe
    <http://[email protected]?subject:unsubscribe>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

Reply via email to