Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Maru Newby Thu, 05 Dec 2013 23:35:51 -0800

On Dec 5, 2013, at 5:21 PM, Isaku Yamahata <[email protected]> wrote:

> On Wed, Dec 04, 2013 at 12:37:19PM +0900,
> Maru Newby <[email protected]> wrote:
> 
>> In the current architecture, the Neutron service handles RPC and WSGI with a 
>> single process and is prone to being overloaded such that agent heartbeats 
>> can be delayed beyond the limit for the agent being declared 'down'.  Even 
>> if we increased the agent timeout as Yongsheg suggests, there is no 
>> guarantee that we can accurately detect whether an agent is 'live' with the 
>> current architecture.  Given that amqp can ensure eventual delivery - it is 
>> a queue - is sending a notification blind such a bad idea?  In the best case 
>> the agent isn't really down and can process the notification.  In the worst 
>> case, the agent really is down but will be brought up eventually by a 
>> deployment's monitoring solution and process the notification when it 
>> returns.  What am I missing? 
>> 
> 
> Do you mean overload of neutron server? Not neutron agent.
> So event agent sends periodic 'live' report, the reports are piled up
> unprocessed by server.
> When server sends notification, it considers agent dead wrongly.
> Not because agent didn't send live reports due to overload of agent.
> Is this understanding correct?

Your interpretation is likely correct.  The demands on the service are going to 
be much higher by virtue of having to field RPC requests from all the agents to 
interact with the database on their behalf.

>> Please consider that while a good solution will track notification delivery 
>> and success, we may need 2 solutions:
>> 
>> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
>> grizzly and havana.
> 
> How about twisting DhcpAgent._periodic_resync_helper?
> If no notification is received form server from last sleep,
> it calls self.sync_state() even if self.needs_resync = False. Thus the
> inconsistency between agent and server due to losing notification
> will be fixed.

Unless I'm missing something, wouldn't forcing more and potentially unnecessary 
resyncs increase the load on the Neutron service and negatively impact 
reliability?

>> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
>> 
>> I'm hoping that coming up with a solution to #1 will allow us the breathing 
>> room to work on #2 in this cycle.
> 
> Loss of notifications is somewhat inevitable, I think.
> (Or logging tasks to stable storage shared between server and agent)
> And Unconditionally sending notifications would cause problem.

Regarding sending notifications unconditionally, what specifically are you 
worried about?  I can imagine 2 scenarios:

Case 1: Send notification to an agent that is incorrectly reported as down. 
Result:  Agent receives notification and acts on it.

Case 2: Send notification to an agent that is actually down.
Result: Agent comes up eventually (in a production environment this should be a 
given) and calls sync_state().  We definitely need to make sync_state more 
reliable, though (I discuss the specifics later in this message).

Notifications could of course be dropped if AMQP queues are not persistent and 
are lost, but I don't think there needs to be a code-based remedy for this.  An 
operator is likely to deploy the AMQP service in HA to prevent the queues from 
being lost, and know to restart everything in the event of catastrophic failure.

That's not to say we don't have work to do, though.  An agent is responsible 
for communicating resource state changes to the service, but the service 
neither detects nor reacts when the state of a resource is scheduled to change 
and fails to do so in a reasonable timeframe.  Thus, as in the bug that 
prompted this discussion, it is up to the user to detect the failure (a VM 
without connectivity).  Ideally, Neutron should be tracking resource state 
changes with sufficient detail and reviewing them periodically to allow timely 
failure detection and remediation.  However, such a change is unlikely to be a 
candidate for backport so it will have to wait.

> 
> You mentioned agent crash. Server crash should also be taken care of
> for reliability. Admin also sometimes wants to restart neutron
> server/agents for some reasons.
> Agent can crash after receiving notifications before start processing
> actual tasks. Server can crash after commiting changes to DB before sending
> notifications. In such cases, notification will be lost.
> Polling to resync would be necessary somewhere.

Agreed, we need to consider the cases of both agent and service failure.  

In the case of service failure, thanks to recently merged patches, the dhcp 
agent will at least force a resync in the event of an error in communicating 
with the server.  However, there is no guarantee that the agent will 
communicate with the server during the downtime.  While polling is one possible 
solution, might it be preferable for the service to simply notify the agents 
when it starts?  The dhcp agent can already receive an agent_updated RPC 
message that triggers a resync.  

> - notification loss isn't considered.
>  self.resync is not always run.
>  some optimization is possible, for example
>  - detect loss by sequence number
>  - polling can be postponed when notifications come without loss.

Notification loss due to agent failure is already solved - sync_state() is 
called on startup.  Notification loss due to server failure could be handled as 
described above.   I think the larger problem is that calling sync_state() does 
not affect processing of notifications already in the queue, which could result 
in stale notifications being processed out-of-order, e.g.

- service sends 'network down' notification
- service goes down after committing 'network up' to db, but before sending 
notification
- service comes back up
- agent knows (somehow) to resync, setting the network 'up'
- agent processes stale 'network down' notification

Though tracking sequence numbers is one possible fix, what do you think of 
instead ignoring all notifications generated before a timestamp set at the 
beginning of sync_state()?  

> - periodic resync spawns threads, but doesn't wait their completion.
>  So if resync takes long time, next resync can start even while
>  resync is going on.

sync_state() now waits for the completion of threads thanks to the following 
patch:

https://review.openstack.org/#/c/59863/

m.

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Reply via email to