Re: [openstack-dev] [tripleo] [neutron] Current containerized neutron agents introduce a significant regression in the dataplane

Brian Haley Tue, 13 Feb 2018 20:41:44 -0800

On 02/13/2018 05:08 PM, Armando M. wrote:

On 13 February 2018 at 14:02, Brent Eagles <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    The neutron agents are implemented in such a way that key
    functionality is implemented in terms of haproxy, dnsmasq,
    keepalived and radvd configuration. The agents manage instances of
    these services but, by design, the parent is the top-most (pid 1).

    On baremetal this has the advantage that, while control plane
    changes cannot be made while the agents are not available, the
    configuration at the time the agents were stopped will work (for
    example, VMs that are restarted can request their IPs, etc). In
    short, the dataplane is not affected by shutting down the agents.

    In the TripleO containerized version of these agents, the supporting
    processes (haproxy, dnsmasq, etc.) are run within the agent's
    container so when the container is stopped, the supporting processes
    are also stopped. That is, the behavior with the current containers
    is significantly different than on baremetal and stopping/restarting
    containers effectively breaks the dataplane. At the moment this is
    being considered a blocker and unless we can find a resolution, we
    may need to recommend running the L3, DHCP and metadata agents on
    baremetal.

I didn't think the neutron metadata agent was affected but just theovn-metadata agent? Or is there a problem with the UNIX domain socketsthe haproxy instances use to connect to it when the container is restarted?

There's quite a bit to unpack here: are you suggesting that runningthese services in HA configuration doesn't help either with the dataplane being gone after a stop/restart? Ultimately this boils down towhere the state is persisted, and while certain agents rely onnamespaces and processes whose ephemeral nature is hard to persist,enough could be done to allow for a non-disruptive bumping of the aforementioned services.

Armando - https://review.openstack.org/#/c/542858/ (if accepted) shouldhelp with dataplane downtime, as sharing the namespaces lets thempersist, which eases what the agent has to configure on the restart of acontainer (think of what the l3-agent needs to create for 1000 routers).

But it doesn't address dnsmasq being unavailable when the dhcp-agentcontainer is restarted like it is today. Maybe one way around that isto run 2+ agents per network, but that still leaves a regression fromhow it works today. Even with l3-ha I'm not sure things are perfect,might wind-up with two masters sometimes.

I've seen one suggestion of putting all these processes in their owncontainer instead of the agent container so they continue to run, itjust might be invasive to the neutron code. Maybe there is another option?


-Brian

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] [neutron] Current containerized neutron agents introduce a significant regression in the dataplane

Reply via email to