Re: [Ryu-devel] [PATCH] Stability improvements to the core of Ryu, without requiring user application changes

Victor Orlikowski Thu, 18 Feb 2016 10:47:59 -0800

On Feb 17, 2016, at 11:16 PM, IWAMOTO Toshihiro <[email protected]> wrote:
> I think the code is mostly okay now.
>


Well - that's a start. ;)

>> +        self.events = hub.PriorityQueue(96)
>> +        self._event_get_timeout = 5
>> +        # The sum of the following semaphores must equal the number of
>> +        # entries in the events queue.
>> +        self._events_sem = hub.BoundedSemaphore(32)
>> +        self._sc_events_sem = hub.BoundedSemaphore(32)
>> +        self._pi_events_sem = hub.BoundedSemaphore(32)
> 
> I still don't like this code much, but assuming event priorities are
> really important to your application, we might need to bear with this
> additional code and overhead.
> 

So - let me tell you where I'm coming from, and we can discuss the need for 
priorities.

With the issue that I described with respect to OFPFC_ADD (potential for switch 
to flood controller with PacketIn events), we were finding that the controller 
could *easily* be swamped with PacketIn events; so much so, that "more 
important" events were not getting processed (even with a longer queue).

It occurs to me that, while the controller can be overwhelmed by *any* of 
several situations where the switch malfunctions and sends too many events, 
this is most *likely* to occur with PacketIn events, if the controller 
application is inserting flows in a "reactive" manner.

This is *especially* true when high-volume flows somehow manage to fall upon a 
rule that forwards to the controller.

Hence, my desire to have PacketIn events be of lower priority; they're the most 
likely (in a reactive flow insertion scenario) to overwhelm the controller.

This does not change the fact, however, that *any* asynchronous or symmetric 
message *could* be abused by a malicious switch to overwhelm the controller; it 
just makes the more likely "overwhelmed by PacketIn events" scenario less 
likely to occur.

In thinking about this, though, I am willing to cut this down to 2 semaphores, 
and a single PriorityQueue of 128 items.
1 semaphore for PacketIns, and 1 for all other types of events (which *should* 
- in general - be far less numerous, in a reactive application case).

> But I wonder, if a _recv_loop blocks, aren't the following messages
> from that switch blocked regardless of event priority?
> Isn't it better to accept as much events as possible without blocking?
> 

If the _recv_loop in the Datapath blocks, it will only be for three reasons, 
that I can tell:
1) The socket recv() receives no data, and the TCP stack does not notify it of 
this happening.
That's a possibility, and the reason I added a timeout on the recv().
Socket operations in eventlet are all non-blocking under the covers, and 
shutdown() with SHUT_RDWR on the non-blocking socket waiting in 
select()/epoll() may not have the effect you expect, without a timeout.

2) A handler called inside the _recv_loop blocks - which is why I added a 
Timeout object.
That said - the Timeout is *not* infallible; CPU-bound handlers will ignore it 
entirely.

3) The _recv_loop is trying to do send_event_to_observers() - and blocks in 
send_event().
In this case, yes, that Datapath is waiting to be able to queue further events 
onto the application's event queue - which means, in reality, that the events 
are being queued in the socket's receive queue (and are waiting to be read by 
recv()).

Accepting as many events as possible would seem a good idea - but, if you've 
somehow got a 10Gbps flow generating PacketIns? You're going to saturate the 
event queue eventually, or run the controller out of memory with a dynamically 
growing queue.

I know; I tried making a dynamically growing event queue in my lab environment. 
;)

>> +            try:
>> +                priority, (ev, state) = 
>> self.events.get(timeout=self._event_get_timeout)
> 
> You no longer need this timeout?
> 

So - "in theory" this timeout is unnecessary.
In practice, I think it's good to have it as a defensive measure.

It causes little harm; the get() waits *up to the timeout* to receive an event, 
then times out.
If it didn't find an event before the timeout expired, it loops back around.

In effect, it ensures that there's a regular wakeup of the greenlet that 
manages the event loop, and the overhead is low (a quick check for None, and a 
return to waiting in get() - thereby allowing other greenlets to be scheduled).

I'd like to leave this in here, for safety's sake.
>> 
>> +                except Exception as e:
>> +                    if ((isinstance(e, hub.Timeout)) and (e is 
>> handler_execution_timeout)):
> 
> Have you tested this code?
> 
> $ python
>>>> issubclass(eventlet.Timeout, Exception)                                    
>>>>  
> False                                                                         
>   
>>>> issubclass(eventlet.Timeout, BaseException)                                
>>>>  
> True                                                                          
>   
> 
> It seems that "except Exception" won't catch the Timeout exception.
> 

Apparently not.
I'm amenable to removing the Timeout on the handlers.
I was being excessively cautious there; furthermore, the fact that the Timeouts 
do no real good with CPU-bound handlers makes them have little worth.

I'll re-submit the patch without them; OK?

So - my plan:
1) Cut the number of semaphores to 2 (PacketIns and "everything else").
2) Increase the size of the priority queue to 128, but keep it static.
3) Remove the handler Timeouts, and the associated configuration option.

I'll follow this email up with a patch to do that; hopefully, that'll be more 
acceptable still. ;)

Best,
Victor
--
Victor J. Orlikowski <> vjo@[cs.]duke.edu


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Ryu-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ryu-devel

Re: [Ryu-devel] [PATCH] Stability improvements to the core of Ryu, without requiring user application changes

Reply via email to