Yet another discovery in our 10G production deployment. Under heavy event load, the event queue can fill up.
We used eventlet.backdoor to debug the state of the greenlets, and found many greenlets stuck thus: (3311, <eventlet.greenthread.GreenThread object at 0x64eb230>) File "/opt/plexus/lib/python2.6/site-packages/eventlet/greenthread.py", line 214, in main result = function(*args, **kwargs) File "/opt/plexus/lib/python2.6/site-packages/ryu/lib/hub.py", line 52, in _launch func(*args, **kwargs) File "/opt/plexus/lib/python2.6/site-packages/ryu/controller/controller.py", line 351, in datapath_connection_factory if datapath.id is None: File "/opt/plexus/lib/python2.6/site-packages/ryu/controller/controller.py", line 271, in serve # Utility methods for convenience File "/opt/plexus/lib/python2.6/site-packages/ryu/controller/controller.py", line 104, in deactivate method(self) File "/opt/plexus/lib/python2.6/site-packages/ryu/controller/controller.py", line 201, in _recv_loop self.ofp_brick.get_handlers(ev) if File "/opt/plexus/lib/python2.6/site-packages/ryu/base/app_manager.py", line 302, in send_event_to_observers self.send_event(observer, ev, state) File "/opt/plexus/lib/python2.6/site-packages/ryu/base/app_manager.py", line 291, in send_event SERVICE_BRICKS[name]._send_event(ev, state) File "/opt/plexus/lib/python2.6/site-packages/ryu/base/app_manager.py", line 279, in _send_event self.events.put((ev, state)) File "/opt/plexus/lib/python2.6/site-packages/eventlet/queue.py", line 262, in put result = waiter.wait() File "/opt/plexus/lib/python2.6/site-packages/eventlet/queue.py", line 140, in wait return get_hub().switch() File "/opt/plexus/lib/python2.6/site-packages/eventlet/hubs/hub.py", line 294, in switch return self.greenlet.switch() Since the put() to the event queue is blocked indefinitely in the receive loop, no further OpenFlow events get processed once this occurs. Whatever thread blocked in the put() will not yield, so no greenlet will ever run to perform a get() to unblock the queue. With our hardware OpenFlow switches, there is a keepalive timer, that sends regular echo requests. If the echo requests are not replied to within a given time interval, the switch disconnects and re-connects. The event queue being full, however, ensures that the switch re-connection is never properly processed - and that leads to a downward spiral of switch disconnection/reconnection and unclosed sockets on the controller. The following patch sets a timeout on the event queue put(), and logs the lost event if the put() times out. In this way, we should at least be able to not block the receive loop from closing the socket, and may give other greenlets a chance to consume the event queue. This patch also includes a minor typo fix. Signed-off-by: Victor J. Orlikowski <[email protected]> diff --git a/ryu/base/app_manager.py b/ryu/base/app_manager.py index 3d5d895..5e4b8f0 100644 --- a/ryu/base/app_manager.py +++ b/ryu/base/app_manager.py @@ -287,7 +287,11 @@ class RyuApp(object): handler(ev) def _send_event(self, ev, state): - self.events.put((ev, state)) + try: + self.events.put((ev, state), timeout=5) + except hub.Full: + LOG.debug("EVENT LOST FOR %s %s", + self.name, ev.__class__.__name__) def send_event(self, name, ev, state=None): """ @@ -520,7 +524,7 @@ class AppManager(object): self._close(app) events = app.events if not events.empty(): - app.logger.debug('%s events remians %d', app.name, events.qsize()) + app.logger.debug('%s events remains %d', app.name, events.qsize()) def close(self): def close_all(close_dict): Best, Victor -- Victor J. Orlikowski <> vjo@[cs.]duke.edu ------------------------------------------------------------------------------ _______________________________________________ Ryu-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ryu-devel
