Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Carl Baldwin Wed, 04 Dec 2013 13:38:31 -0800

Sorry to have taken the discussion on a slight tangent.  I meant only
to offer the solution as a stop-gap.  I agree that the fundamental
problem should still be addressed.


On Tue, Dec 3, 2013 at 8:01 PM, Maru Newby <[email protected]> wrote:
>
> On Dec 4, 2013, at 1:47 AM, Stephen Gran <[email protected]> wrote:
>
>> On 03/12/13 16:08, Maru Newby wrote:
>>> I've been investigating a bug that is preventing VM's from receiving IP 
>>> addresses when a Neutron service is under high load:
>>>
>>> https://bugs.launchpad.net/neutron/+bug/1192381
>>>
>>> High load causes the DHCP agent's status updates to be delayed, causing the 
>>> Neutron service to assume that the agent is down.  This results in the 
>>> Neutron service not sending notifications of port addition to the DHCP 
>>> agent.  At present, the notifications are simply dropped.  A simple fix is 
>>> to send notifications regardless of agent status.  Does anybody have any 
>>> objections to this stop-gap approach?  I'm not clear on the implications of 
>>> sending notifications to agents that are down, but I'm hoping for a simple 
>>> fix that can be backported to both havana and grizzly (yes, this bug has 
>>> been with us that long).
>>>
>>> Fixing this problem for real, though, will likely be more involved.  The 
>>> proposal to replace the current wsgi framework with Pecan may increase the 
>>> Neutron service's scalability, but should we continue to use a 'fire and 
>>> forget' approach to notification?  Being able to track the success or 
>>> failure of a given action outside of the logs would seem pretty important, 
>>> and allow for more effective coordination with Nova than is currently 
>>> possible.
>>
>> It strikes me that we ask an awful lot of a single neutron-server instance - 
>> it has to take state updates from all the agents, it has to do scheduling, 
>> it has to respond to API requests, and it has to communicate about actual 
>> changes with the agents.
>>
>> Maybe breaking some of these out the way nova has a scheduler and a 
>> conductor and so on might be a good model (I know there are things people 
>> are unhappy about with nova-scheduler, but imagine how much worse it would 
>> be if it was built into the API).
>>
>> Doing all of those tasks, and doing it largely single threaded, is just 
>> asking for overload.
>
> I'm sorry if it wasn't clear in my original message, but my primary concern 
> lies with the reliability rather than the scalability of the Neutron service. 
>  Carl's addition of multiple workers is a good stop-gap to minimize the 
> impact of blocking IO calls in the current architecture, and we already have 
> consensus on the need to separate RPC and WSGI functions as part of the Pecan 
> rewrite.  I am worried, though, that we are not being sufficiently diligent 
> in how we manage state transitions through notifications.  Managing 
> transitions and their associate error states is needlessly complicated by the 
> current ad-hoc approach, and I'd appreciate input on the part of distributed 
> systems experts as to how we could do better.
>
>
> m.
>
>
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Reply via email to