Dear Ryu developer team,

I am using a Ryu SDN controller on top of Mininet to deploy and test an active 
loss monitoring system intended for datacenter topologies. Everything woks 
quite nicely for smaller networks but we want to have some data that is closer 
to the scale of a datacenter so we aimed for a topology of the size of at least 
FatTree(16).

Concretely that means:

-         - 320 Distinct switches

-         - 2048 Links between those switches

-         - 190 Hosts attached to switches

I have the necessary resources on my university server, CPU and Memory load is 
only rarely maxed out (In a brief moment during startup or while 
postprocessing) when monitoring with htop.

I am consistently observing from the ryu logs that the Datapaths are 
disconnecting. This means an entry like this:

unregister Switch<dpid=294, Port<dpid=294, port_no=1, LIVE> Port<dpid=294, 
port_no=3, LIVE> Port<dpid=294, port_no=4, LIVE> Port<dpid=294, port_no=5, 
LIVE> Port<dpid=294, port_no=6, LIVE> Port<dpid=294, port_no=7, LIVE> 
Port<dpid=294, port_no=8, LIVE> Port<dpid=294, port_no=9, LIVE> Port<dpid=294, 
port_no=10, LIVE> Port<dpid=294, port_no=11, LIVE> Port<dpid=294, port_no=12, 
LIVE> Port<dpid=294, port_no=13, LIVE> Port<dpid=294, port_no=14, LIVE> 
Port<dpid=294, port_no=15, LIVE> Port<dpid=294, port_no=16, LIVE> 
Port<dpid=294, port_no=17, LIVE> Port<dpid=294, port_no=18, LIVE> >

I have also seen switches disconnecting for smaller network sizes, but this 
usually happens in a later stage of running my code which leads to less 
problems. When running my big network, these events happen during startup, if I 
push flows to the switches from the controller while some of them are not 
properly connected, some of the flows don’t get pushed and this leads to black 
holes in my system. This is usually accompanied by such log entries:

Datapath in process of terminating; send() to ('127.0.0.1', 40006) discarded.

My best guess to why this is happening is that the operating system might be 
overcharged with all the context switches needed to keep the emulation running 
smoothly and that some keepalive timer somewhere is running out because one of 
the connecting processes was starved out of CPU time. But of course I might 
also have messed up in my implementation somewhere.

Since my goal is not to model a dynamically changing topology or complete 
switch failures, there is no reason for this behavior to be part of my 
emulation. So I tried finding any timers in the code that could be causing 
these unwanted disconnects and deactivating them or setting them to a very high 
interval:

In Ryu:
In controller.py

-         - I set the default value of the ‘echo-request -interval’ to 604800

-         - ‘maximum-unreplied-echo-requests’ was set to 10

-         - I set the socket timeout to None in the __init__ method of the 
Datapath class

In Mininet:

-        - I set the default value of “reconnectms” to zero in the __init__ 
method of the OVSSwitch Class, in the file node.py

In OVS:

-        - I set the probe_interval to zero in the __init__ method of the 
Reconnect class to disable the keepalive feature in the file reconnect.py.

Unfortunately, all of this did not stop the switches from disconnecting.
Could anyone more familiar point me in the right direction for further 
investigation? Did I miss a timer somewhere or do you have a different 
explanation for this behavior and how to stop it?

Thank you for your assistance.

Christelle Gloor
_______________________________________________
Ryu-devel mailing list
Ryu-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ryu-devel

Reply via email to