So far my results are: OVS 2.4.0: ODL configure with 2G of mem —> max is ~50 switches connected OVS 2.3.1: ODL configure with 256MG of mem —> I currently have 150 switches connected, can’t scale more due to infra limits.
I will pursue me testing next week. Thanks, Alexis > On Feb 19, 2016, at 5:06 PM, Abhijit Kumbhare <[email protected]> wrote: > > Interesting. I wonder - why that would be? > > On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët <[email protected] > <mailto:[email protected]>> wrote: > OVS 2.3.x scales fine > OVS 2.4.x doesn’t scale well. > > Here is also the docker file for ovs 2.4.1 > > > >> On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët <[email protected] >> <mailto:[email protected]>> wrote: >> >>> can I use your containers? do you have any scripts/tools to bring things >>> up/down? >> >> Sure, attached a tar file containing all scripts / config / dockerfile I’m >> using to setup docker containers emulating OvS. >> FYI: it’s ovs 2.3.0 and not 2.4.0 anymore >> >> Also, forget about this whole mail thread, something in my private container >> must be breaking OVS behaviour, I don’t know what yet. >> >> With the docker file attached here, I can scale 90+ without any trouble... >> >> Thanks, >> Alexis >> >> <ovs_scalability_setup.tar.gz> >> >>> On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> inline... >>> >>> On 02/18/2016 02:58 PM, Alexis de Talhouët wrote: >>>> I’m running OVS 2.4, against stable/lithium, openflowplugin-li >>> >>> >>> so this is one difference between CSIT and your setup, in addition to the >>> whole >>> containers vs mininet. >>> >>>> I never scaled up to 1k, this was in the CSIT job. >>>> In a real scenario, I scaled to ~400. But it was even before clustering >>>> came into play in ofp lithium. >>>> >>>> I think the log I sent have log trace for openflowplugin and openflowjava, >>>> it not the case I could resubmit the logs. >>>> I removed some of them in openflowjava because it was way to chatty >>>> (logging all messages content between ovs <---> odl) >>>> >>>> Unfortunately those IOException happen after the whole thing blow up. I >>>> was able to narrow done some logs in openflowjava >>>> to see the first disconnected event. As mentioned in a previous mail (in >>>> this mail thread) it’s the device that is >>>> issuing the disconnect: >>>> >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder >>>>> | 201 - >>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>> skipping bytebuf - too few bytes for header: 0 < 8 >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFVersionDetector >>>>> | 201 - >>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>> not enough data >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>> DelegatingInboundHandler | 201 - >>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>> Channel inactive >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>> ConnectionAdapterImpl | 201 - >>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>> ConsumeIntern msg on [id: 0x1efab5fb, >>>>> /172.18.0.49:36983 <http://172.18.0.49:36983/> :> /192.168.1.159:6633 >>>>> <http://192.168.1.159:6633/>] >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>> ConnectionAdapterImpl | 201 - >>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>> ConsumeIntern msg - DisconnectEvent >>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>> ConnectionContextImpl | 205 - >>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: >>>>> node=/172.18.0.49:36983|auxId=0|connection >>>>> state = RIP >>>> >>>> Those logs come from another run, so are not in the logs I sent earlier. >>>> Although the behaviour is always the same. >>>> >>>> Regarding the memory, I don’t want to add more than 2G memory, because, >>>> and I tested it, the more memory I add, the more >>>> I can scale. But as you pointed out, >>>> this issue is not OOM error. Thus I rather like failing at 2G (less docker >>>> containers to spawn each run ~50). >>> >>> so, maybe reduce your memory then to simplify the reproducing steps. Since >>> you know that increasing >>> memory allows you to scale further, but still hit the problem; let's make >>> it easier to hit. how far >>> can you go with the max mem set to 500M? if you are only loading ofp-li. >>> >>>> I definitely need some help here, because I can’t sort myself out in the >>>> openflowplugin + openflowjava codebase… >>>> But I believe I already have Michal’s attention :) >>> >>> can I use your containers? do you have any scripts/tools to bring things >>> up/down? >>> I might be able to try and reproduce myself. I like breaking things :) >>> >>> JamO >>> >>> >>>> >>>> Thanks, >>>> Alexis >>>> >>>> >>>>> On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] >>>>> <mailto:[email protected]> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>> >>>>> Alexis, don't worry about filing a bug just to give us a common place to >>>>> work/comment, even >>>>> if we close it later because of something outside of ODL. Email is fine >>>>> too. >>>>> >>>>> what ovs version do you have in your containers? this test sounds great. >>>>> >>>>> Luis is right, that if you were scaling well past 1k in the past, but now >>>>> it falls over at >>>>> 50 it sounds like a bug. >>>>> >>>>> Oh, you can try increasing the jvm max_mem from default of 2G just as a >>>>> data point. The >>>>> fact that you don't get OOMs makes me think memory might not be the final >>>>> bottleneck. >>>>> >>>>> you could enable debug/trace logs in the right modules (need ofp devs to >>>>> tell us that) >>>>> for a little more info. >>>>> >>>>> I've seen those IOExceptions before and always assumed it was from an OF >>>>> switch doing a >>>>> hard RST on it's connection. >>>>> >>>>> >>>>> Thanks, >>>>> JamO >>>>> >>>>> >>>>> >>>>> On 02/18/2016 11:48 AM, Luis Gomez wrote: >>>>>> If the same test worked 6-8 months ago this seems like a bug, but please >>>>>> feel free to open it whenever you are sure. >>>>>> >>>>>>> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët >>>>>>> <[email protected] <mailto:[email protected]> >>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>> wrote: >>>>>>> >>>>>>> Hello Luis, >>>>>>> >>>>>>> For sure I’m willing to open a bug but before I want to make sure there >>>>>>> is a bug and that I’m not doing something wrong. >>>>>>> In ODL’s infra, there is a test to find the maximum number of switches >>>>>>> that can be connected to ODL, and this test >>>>>>> reach ~ 500 [0] >>>>>>> I was able to scale up to 1090 switches [1] using the CSIT job in the >>>>>>> sandbox. >>>>>>> I believe the CSIT test is different in a way that switches are >>>>>>> emulated in one mininet VM, whereas I’m connecting OVS >>>>>>> instances from separate containers. >>>>>>> >>>>>>> 6-8 months ago, I was able to perform the same test, and scale with OVS >>>>>>> docker container up to ~400 before ODL start >>>>>>> crashing (with some optimization done behind the scene, i.e. ulimit, >>>>>>> mem, cpu, GC…) >>>>>>> Now I’m not able to scale more than 100 with the same configuration. >>>>>>> >>>>>>> FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the >>>>>>> test is actually failing but it is not correctly >>>>>>> advertised… switch connection are dropped. >>>>>>> Look for those: >>>>>>> 016-02-18 07:07:51,741 | WARN | entLoopGroup-6-6 | OFFrameDecoder >>>>>>> | 181 - >>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | >>>>>>> Unexpected exception from downstream. >>>>>>> java.io.IOException: Connection reset by peer >>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85] >>>>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85] >>>>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85] >>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85] >>>>>>> at >>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85] >>>>>>> at >>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final] >>>>>>> at >>>>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final] >>>>>>> at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] >>>>>>> >>>>>>> >>>>>>> [0]: >>>>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/ >>>>>>> >>>>>>> <https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/> >>>>>>> [1]: https://git.opendaylight.org/gerrit/#/c/33213/ >>>>>>> <https://git.opendaylight.org/gerrit/#/c/33213/> >>>>>>> >>>>>>>> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] >>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>> <mailto:[email protected]>>> wrote: >>>>>>>> >>>>>>>> Alexis, thanks very much for sharing this test. Would you mind to open >>>>>>>> a bug with all this info so we can track this? >>>>>>>> >>>>>>>> >>>>>>>>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët >>>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Michal, >>>>>>>>> >>>>>>>>> ODL memory is capped at 2go, the more memory I add, those more OVS I >>>>>>>>> can connect. Regarding CPU, it’s around 10-20% >>>>>>>>> when connecting new OVS, with some peak to 80%. >>>>>>>>> >>>>>>>>> After some investigation, here is what I observed: >>>>>>>>> Let say I have 50 switches connected, stat manager disabled. I have >>>>>>>>> one opened socket per switch, plus an additional >>>>>>>>> one for the controller. >>>>>>>>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… >>>>>>>>> something is happening causing all connection to >>>>>>>>> be dropped (by device?) and then ODL >>>>>>>>> try to recreate them and goes in a crazy loop where it is never able >>>>>>>>> to re-establish communication, but keeps >>>>>>>>> creating new sockets. >>>>>>>>> I’m suspecting something being garbage collected due to lack of >>>>>>>>> memory, although no OOM errors. >>>>>>>>> >>>>>>>>> Attached the YourKit Java Profiler analysis for the described >>>>>>>>> scenario and the logs [1]. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Alexis >>>>>>>>> >>>>>>>>> [1]: >>>>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0 >>>>>>>>> >>>>>>>>> <https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0> >>>>>>>>> >>>>>>>>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON >>>>>>>>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Alexis, >>>>>>>>>> I am not sure how OVS uses threads - in changelog there is some >>>>>>>>>> concurrency related improvement in 2.1.3 and 2.3. >>>>>>>>>> Also I guess docker can be forced regarding assigned resources. >>>>>>>>>> >>>>>>>>>> For you the most important is the amount of cores used by controller. >>>>>>>>>> >>>>>>>>>> How does your cpu and memory consumption look like when you connect >>>>>>>>>> all the OVSs? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Michal >>>>>>>>>> >>>>>>>>>> ________________________________________ >>>>>>>>>> From: Alexis de Talhouët <[email protected] >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>>> <mailto:[email protected]>>> >>>>>>>>>> Sent: Tuesday, February 9, 2016 14:44 >>>>>>>>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco) >>>>>>>>>> Cc: [email protected] >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected] >>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>> Subject: Re: [openflowplugin-dev] Scalability issues >>>>>>>>>> >>>>>>>>>> Hello Michal, >>>>>>>>>> >>>>>>>>>> Yes, all the OvS instances I’m running has a unique DPID. >>>>>>>>>> >>>>>>>>>> Regarding the thread limit for netty, I’m running test in a server >>>>>>>>>> that has 28 CPU(s). >>>>>>>>>> >>>>>>>>>> Does each OvS instances is assigned its own thread? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Alexis >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON >>>>>>>>>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]> >>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Alexis, >>>>>>>>>>> in Li-design there is the stats manager not in form of standalone >>>>>>>>>>> app but as part of core of ofPlugin. You can >>>>>>>>>>> disable it via rpc. >>>>>>>>>>> >>>>>>>>>>> Just a question regarding your ovs setup. Do you have all DPIDs >>>>>>>>>>> unique? >>>>>>>>>>> >>>>>>>>>>> Also there is limit for netty in form of amount of used threads. By >>>>>>>>>>> default it uses 2 x cpu_cores_amount. You >>>>>>>>>>> should have as many cores as possible in order to get max >>>>>>>>>>> performance. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Michal >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ________________________________________ >>>>>>>>>>> From: [email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>> <[email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>> <mailto:[email protected]>>> on >>>>>>>>>>> behalf of Alexis de Talhouët <[email protected] >>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>>>> <mailto:[email protected]>>> >>>>>>>>>>> Sent: Tuesday, February 9, 2016 00:45 >>>>>>>>>>> To: [email protected] >>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>> Subject: [openflowplugin-dev] Scalability issues >>>>>>>>>>> >>>>>>>>>>> Hello openflowplugin-dev, >>>>>>>>>>> >>>>>>>>>>> I’m currently running some scalability test against >>>>>>>>>>> openflowplugin-li plugin, stable/lithium. >>>>>>>>>>> Playing with CSIT job, I was able to connect up to 1090 switches: >>>>>>>>>>> https://git.opendaylight.org/gerrit/#/c/33213/ >>>>>>>>>>> <https://git.opendaylight.org/gerrit/#/c/33213/> >>>>>>>>>>> >>>>>>>>>>> I’m now running the test against 40 OvS switches, each one of them >>>>>>>>>>> is in a docker container. >>>>>>>>>>> >>>>>>>>>>> Connecting around 30 of them works fine, but then, adding a new one >>>>>>>>>>> break completely ODL, it goes crazy and >>>>>>>>>>> unresponsible. >>>>>>>>>>> Attach a snippet of the karaf.log with log set to DEBUG for >>>>>>>>>>> org.opendaylight.openflowplugin, thus it’s a really >>>>>>>>>>> big log (~2.5MB). >>>>>>>>>>> >>>>>>>>>>> Here it what I observed based on the log: >>>>>>>>>>> I have 30 switches connected, all works fine. Then I add a new one: >>>>>>>>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 >>>>>>>>>>> 23:13:38,534) >>>>>>>>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546) >>>>>>>>>>> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520) >>>>>>>>>>> - Creation of the transaction chain, … >>>>>>>>>>> >>>>>>>>>>> Then all starts failing apart with this log: >>>>>>>>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | >>>>>>>>>>>> ConnectionContextImpl | 190 - >>>>>>>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | >>>>>>>>>>>> disconnecting: >>>>>>>>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP >>>>>>>>>>> End then ConnectionContextImpl disconnects one by one the switches, >>>>>>>>>>> RpcManagerImpl is unregistered >>>>>>>>>>> Then it goes crazy for a while. >>>>>>>>>>> But all I’ve done is adding a new switch.. >>>>>>>>>>> >>>>>>>>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown: >>>>>>>>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | >>>>>>>>>>>> LocalThreePhaseCommitCohort | 172 - >>>>>>>>>>>> org.opendaylight.controller.sal-distributed-datastore - >>>>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction >>>>>>>>>>>> member-1-chn-5-txn-180 on backend >>>>>>>>>>>> akka.pattern.AskTimeoutException: Ask timed out on >>>>>>>>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/ <>), >>>>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]] >>>>>>>>>>>> after [30000 ms] >>>>>>>>>>> And it goes for a while. >>>>>>>>>>> >>>>>>>>>>> Do you have any input on the same? >>>>>>>>>>> >>>>>>>>>>> Could you give some advice to be able to scale? (I know disabling >>>>>>>>>>> StatisticManager can help for instance) >>>>>>>>>>> >>>>>>>>>>> Am I doing something wrong? >>>>>>>>>>> >>>>>>>>>>> I can provide any asked information regarding the issue I’m facing. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Alexis >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> openflowplugin-dev mailing list >>>>>>>>> [email protected] >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected] >>>>>>>>> <mailto:[email protected]>> >>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >>>>>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> openflowplugin-dev mailing list >>>>>> [email protected] >>>>>> <mailto:[email protected]> >>>>>> <mailto:[email protected] >>>>>> <mailto:[email protected]>> >>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev> > > > _______________________________________________ > openflowplugin-dev mailing list > [email protected] > <mailto:[email protected]> > https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev > <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev> > >
_______________________________________________ openflowplugin-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
