Dear Ryu developer team, I am using a Ryu SDN controller on top of Mininet to deploy and test an active loss monitoring system intended for datacenter topologies. Everything woks quite nicely for smaller networks but we want to have some data that is closer to the scale of a datacenter so we aimed for a topology of the size of at least FatTree(16).
Concretely that means: - - 320 Distinct switches - - 2048 Links between those switches - - 190 Hosts attached to switches I have the necessary resources on my university server, CPU and Memory load is only rarely maxed out (In a brief moment during startup or while postprocessing) when monitoring with htop. I am consistently observing from the ryu logs that the Datapaths are disconnecting. This means an entry like this: unregister Switch<dpid=294, Port<dpid=294, port_no=1, LIVE> Port<dpid=294, port_no=3, LIVE> Port<dpid=294, port_no=4, LIVE> Port<dpid=294, port_no=5, LIVE> Port<dpid=294, port_no=6, LIVE> Port<dpid=294, port_no=7, LIVE> Port<dpid=294, port_no=8, LIVE> Port<dpid=294, port_no=9, LIVE> Port<dpid=294, port_no=10, LIVE> Port<dpid=294, port_no=11, LIVE> Port<dpid=294, port_no=12, LIVE> Port<dpid=294, port_no=13, LIVE> Port<dpid=294, port_no=14, LIVE> Port<dpid=294, port_no=15, LIVE> Port<dpid=294, port_no=16, LIVE> Port<dpid=294, port_no=17, LIVE> Port<dpid=294, port_no=18, LIVE> > I have also seen switches disconnecting for smaller network sizes, but this usually happens in a later stage of running my code which leads to less problems. When running my big network, these events happen during startup, if I push flows to the switches from the controller while some of them are not properly connected, some of the flows don’t get pushed and this leads to black holes in my system. This is usually accompanied by such log entries: Datapath in process of terminating; send() to ('127.0.0.1', 40006) discarded. My best guess to why this is happening is that the operating system might be overcharged with all the context switches needed to keep the emulation running smoothly and that some keepalive timer somewhere is running out because one of the connecting processes was starved out of CPU time. But of course I might also have messed up in my implementation somewhere. Since my goal is not to model a dynamically changing topology or complete switch failures, there is no reason for this behavior to be part of my emulation. So I tried finding any timers in the code that could be causing these unwanted disconnects and deactivating them or setting them to a very high interval: In Ryu: In controller.py - - I set the default value of the ‘echo-request -interval’ to 604800 - - ‘maximum-unreplied-echo-requests’ was set to 10 - - I set the socket timeout to None in the __init__ method of the Datapath class In Mininet: - - I set the default value of “reconnectms” to zero in the __init__ method of the OVSSwitch Class, in the file node.py In OVS: - - I set the probe_interval to zero in the __init__ method of the Reconnect class to disable the keepalive feature in the file reconnect.py. Unfortunately, all of this did not stop the switches from disconnecting. Could anyone more familiar point me in the right direction for further investigation? Did I miss a timer somewhere or do you have a different explanation for this behavior and how to stop it? Thank you for your assistance. Christelle Gloor
_______________________________________________ Ryu-devel mailing list Ryu-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ryu-devel