Thanks, this is enough for now, I raised the priority to critical so it can be fixed by next Be SR.
BR/Luis > On Mar 15, 2016, at 11:29 AM, Alexis de Talhouët <[email protected]> > wrote: > > I did, and started something in int/test but haven’t got the time to finish > it. > > https://bugs.opendaylight.org/show_bug.cgi?id=5464 > <https://bugs.opendaylight.org/show_bug.cgi?id=5464> > https://git.opendaylight.org/gerrit/#/c/35813/ > <https://git.opendaylight.org/gerrit/#/c/35813/> > > I agree with the serious problem with ovs2.4, but right now i’m trying to > solve a FD leak in netconf :) > > Thanks, > Alexis > >> On Mar 15, 2016, at 2:27 PM, Luis Gomez <[email protected] >> <mailto:[email protected]>> wrote: >> >> Alexis, did you open a bug with all the information for this? we are >> releasing Be SR1 and I believe we still have serious perf issues with OVS >> 2.4. >> >> BR/Luis >> >> >> >>> On Mar 4, 2016, at 4:56 PM, Jamo Luhrsen <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Alexis, >>> >>> thanks for the bug and the patch, and keep up the good work digging at >>> openflowplugin. >>> >>> JamO >>> >>> On 03/04/2016 07:38 AM, Alexis de Talhouët wrote: >>>> JamO, >>>> >>>> Here is the bug: https://bugs.opendaylight.org/show_bug.cgi?id=5464 >>>> <https://bugs.opendaylight.org/show_bug.cgi?id=5464> >>>> Here is the patch in int/test: >>>> https://git.opendaylight.org/gerrit/#/c/35813/ >>>> <https://git.opendaylight.org/gerrit/#/c/35813/> >>>> It is still WIP. And yes I believe we should have a CSIT job running the >>>> test. >>>> >>>> Thanks, >>>> Alexis >>>>> On Mar 3, 2016, at 12:41 AM, Jamo Luhrsen <[email protected] >>>>> <mailto:[email protected]> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>> >>>>> >>>>> >>>>> On 02/19/2016 02:10 PM, Alexis de Talhouët wrote: >>>>>> So far my results are: >>>>>> >>>>>> OVS 2.4.0: ODL configure with 2G of mem —> max is ~50 switches connected >>>>>> OVS 2.3.1: ODL configure with 256MG of mem —> I currently have 150 >>>>>> switches connected, can’t scale more due to infra >>>>>> limits. >>>>> >>>>> Alexis, I think this is probably worth putting a bugzilla up. >>>>> >>>>> How much horsepower do you need per docker ovs instance? We need to get >>>>> this >>>>> automated in CSIT. Marcus from ovsdb wants to do similar tests with >>>>> ovsdb. >>>>> >>>>> JamO >>>>> >>>>> >>>>>> I will pursue me testing next week. >>>>>> >>>>>> Thanks, >>>>>> Alexis >>>>>> >>>>>>> On Feb 19, 2016, at 5:06 PM, Abhijit Kumbhare <[email protected] >>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>> <mailto:[email protected]>> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>> >>>>>>> Interesting. I wonder - why that would be? >>>>>>> >>>>>>> On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët >>>>>>> <[email protected] <mailto:[email protected]> >>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>> wrote: >>>>>>> >>>>>>> OVS 2.3.x scales fine >>>>>>> OVS 2.4.x doesn’t scale well. >>>>>>> >>>>>>> Here is also the docker file for ovs 2.4.1 >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët >>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> can I use your containers? do you have any scripts/tools to bring >>>>>>>>> things up/down? >>>>>>>> >>>>>>>> Sure, attached a tar file containing all scripts / config / >>>>>>>> dockerfile I’m using to setup docker containers >>>>>>>> emulating OvS. >>>>>>>> FYI: it’s ovs 2.3.0 and not 2.4.0 anymore >>>>>>>> >>>>>>>> Also, forget about this whole mail thread, something in my private >>>>>>>> container must be breaking OVS behaviour, I >>>>>>>> don’t know what yet. >>>>>>>> >>>>>>>> With the docker file attached here, I can scale 90+ without any >>>>>>>> trouble... >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Alexis >>>>>>>> >>>>>>>> <ovs_scalability_setup.tar.gz> >>>>>>>> >>>>>>>>> On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected] >>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>> <mailto:[email protected]>> >>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>> >>>>>>>>> inline... >>>>>>>>> >>>>>>>>> On 02/18/2016 02:58 PM, Alexis de Talhouët wrote: >>>>>>>>>> I’m running OVS 2.4, against stable/lithium, openflowplugin-li >>>>>>>>> >>>>>>>>> >>>>>>>>> so this is one difference between CSIT and your setup, in addition >>>>>>>>> to the whole >>>>>>>>> containers vs mininet. >>>>>>>>> >>>>>>>>>> I never scaled up to 1k, this was in the CSIT job. >>>>>>>>>> In a real scenario, I scaled to ~400. But it was even before >>>>>>>>>> clustering came into play in ofp lithium. >>>>>>>>>> >>>>>>>>>> I think the log I sent have log trace for openflowplugin and >>>>>>>>>> openflowjava, it not the case I could resubmit the >>>>>>>>>> logs. >>>>>>>>>> I removed some of them in openflowjava because it was way to chatty >>>>>>>>>> (logging all messages content between ovs >>>>>>>>>> <---> odl) >>>>>>>>>> >>>>>>>>>> Unfortunately those IOException happen after the whole thing blow >>>>>>>>>> up. I was able to narrow done some logs in >>>>>>>>>> openflowjava >>>>>>>>>> to see the first disconnected event. As mentioned in a previous >>>>>>>>>> mail (in this mail thread) it’s the device that is >>>>>>>>>> issuing the disconnect: >>>>>>>>>> >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> OFFrameDecoder | 201 - >>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>> 0.6.4.SNAPSHOT | skipping bytebuf - too few bytes for >>>>>>>>>>> header: 0 < 8 >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> OFVersionDetector | 201 - >>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>> 0.6.4.SNAPSHOT | not enough data >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> DelegatingInboundHandler | 201 - >>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>> 0.6.4.SNAPSHOT | Channel inactive >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> ConnectionAdapterImpl | 201 - >>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg on [id: 0x1efab5fb, >>>>>>>>>>> /172.18.0.49:36983 <http://172.18.0.49:36983/ >>>>>>>>>>> <http://172.18.0.49:36983/>> :> /192.168.1.159:6633 >>>>>>>>>>> <http://192.168.1.159:6633/ <http://192.168.1.159:6633/>>] >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> ConnectionAdapterImpl | 201 - >>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg - DisconnectEvent >>>>>>>>>>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | >>>>>>>>>>> ConnectionContextImpl | 205 - >>>>>>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | >>>>>>>>>>> disconnecting: node=/172.18.0.49:36983|auxId=0|connection >>>>>>>>>>> state = RIP >>>>>>>>>> >>>>>>>>>> Those logs come from another run, so are not in the logs I sent >>>>>>>>>> earlier. Although the behaviour is always the >>>>>>>>>> same. >>>>>>>>>> >>>>>>>>>> Regarding the memory, I don’t want to add more than 2G memory, >>>>>>>>>> because, and I tested it, the more memory I add, >>>>>>>>>> the more >>>>>>>>>> I can scale. But as you pointed out, >>>>>>>>>> this issue is not OOM error. Thus I rather like failing at 2G (less >>>>>>>>>> docker containers to spawn each run ~50). >>>>>>>>> >>>>>>>>> so, maybe reduce your memory then to simplify the reproducing steps. >>>>>>>>> Since you know that increasing >>>>>>>>> memory allows you to scale further, but still hit the problem; let's >>>>>>>>> make it easier to hit. how far >>>>>>>>> can you go with the max mem set to 500M? if you are only loading >>>>>>>>> ofp-li. >>>>>>>>> >>>>>>>>>> I definitely need some help here, because I can’t sort myself out >>>>>>>>>> in the openflowplugin + openflowjava codebase… >>>>>>>>>> But I believe I already have Michal’s attention :) >>>>>>>>> >>>>>>>>> can I use your containers? do you have any scripts/tools to bring >>>>>>>>> things up/down? >>>>>>>>> I might be able to try and reproduce myself. I like breaking things >>>>>>>>> :) >>>>>>>>> >>>>>>>>> JamO >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Alexis >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] >>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Alexis, don't worry about filing a bug just to give us a common >>>>>>>>>>> place to work/comment, even >>>>>>>>>>> if we close it later because of something outside of ODL. Email >>>>>>>>>>> is fine too. >>>>>>>>>>> >>>>>>>>>>> what ovs version do you have in your containers? this test sounds >>>>>>>>>>> great. >>>>>>>>>>> >>>>>>>>>>> Luis is right, that if you were scaling well past 1k in the past, >>>>>>>>>>> but now it falls over at >>>>>>>>>>> 50 it sounds like a bug. >>>>>>>>>>> >>>>>>>>>>> Oh, you can try increasing the jvm max_mem from default of 2G just >>>>>>>>>>> as a data point. The >>>>>>>>>>> fact that you don't get OOMs makes me think memory might not be >>>>>>>>>>> the final bottleneck. >>>>>>>>>>> >>>>>>>>>>> you could enable debug/trace logs in the right modules (need ofp >>>>>>>>>>> devs to tell us that) >>>>>>>>>>> for a little more info. >>>>>>>>>>> >>>>>>>>>>> I've seen those IOExceptions before and always assumed it was from >>>>>>>>>>> an OF switch doing a >>>>>>>>>>> hard RST on it's connection. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> JamO >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 02/18/2016 11:48 AM, Luis Gomez wrote: >>>>>>>>>>>> If the same test worked 6-8 months ago this seems like a bug, but >>>>>>>>>>>> please feel free to open it whenever you >>>>>>>>>>>> are sure. >>>>>>>>>>>> >>>>>>>>>>>>> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët >>>>>>>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>> >>>>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>>>> <mailto:[email protected]>> <mailto:[email protected] >>>>>>>>>>>>> <mailto:[email protected]>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hello Luis, >>>>>>>>>>>>> >>>>>>>>>>>>> For sure I’m willing to open a bug but before I want to make >>>>>>>>>>>>> sure there is a bug and that I’m not doing >>>>>>>>>>>>> something wrong. >>>>>>>>>>>>> In ODL’s infra, there is a test to find the maximum number of >>>>>>>>>>>>> switches that can be connected to ODL, and >>>>>>>>>>>>> this test >>>>>>>>>>>>> reach ~ 500 [0] >>>>>>>>>>>>> I was able to scale up to 1090 switches [1] using the CSIT job >>>>>>>>>>>>> in the sandbox. >>>>>>>>>>>>> I believe the CSIT test is different in a way that switches are >>>>>>>>>>>>> emulated in one mininet VM, whereas I’m >>>>>>>>>>>>> connecting OVS >>>>>>>>>>>>> instances from separate containers. >>>>>>>>>>>>> >>>>>>>>>>>>> 6-8 months ago, I was able to perform the same test, and scale >>>>>>>>>>>>> with OVS docker container up to ~400 before >>>>>>>>>>>>> ODL start >>>>>>>>>>>>> crashing (with some optimization done behind the scene, i.e. >>>>>>>>>>>>> ulimit, mem, cpu, GC…) >>>>>>>>>>>>> Now I’m not able to scale more than 100 with the same >>>>>>>>>>>>> configuration. >>>>>>>>>>>>> >>>>>>>>>>>>> FYI: I just quickly look at the CSIT test [0] karaf.log, it >>>>>>>>>>>>> seems the test is actually failing but it is not >>>>>>>>>>>>> correctly >>>>>>>>>>>>> advertised… switch connection are dropped. >>>>>>>>>>>>> Look for those: >>>>>>>>>>>>> 016-02-18 07:07:51,741 | WARN | entLoopGroup-6-6 | >>>>>>>>>>>>> OFFrameDecoder | 181 - >>>>>>>>>>>>> org.opendaylight.openflowjava.openflow-protocol-impl - >>>>>>>>>>>>> 0.6.4.SNAPSHOT | Unexpected exception from downstream. >>>>>>>>>>>>> java.io.IOException: Connection reset by peer >>>>>>>>>>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85] >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85] >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85] >>>>>>>>>>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85] >>>>>>>>>>>>> at >>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final] >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final] >>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [0]: >>>>>>>>>>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/ >>>>>>>>>>>>> >>>>>>>>>>>>> <https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/> >>>>>>>>>>>>> [1]: https://git.opendaylight.org/gerrit/#/c/33213/ >>>>>>>>>>>>> >>>>>>>>>>>>>> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] >>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Alexis, thanks very much for sharing this test. Would you mind >>>>>>>>>>>>>> to open a bug with all this info so we can >>>>>>>>>>>>>> track this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët >>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Michal, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ODL memory is capped at 2go, the more memory I add, those more >>>>>>>>>>>>>>> OVS I can connect. Regarding CPU, it’s >>>>>>>>>>>>>>> around 10-20% >>>>>>>>>>>>>>> when connecting new OVS, with some peak to 80%. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After some investigation, here is what I observed: >>>>>>>>>>>>>>> Let say I have 50 switches connected, stat manager disabled. I >>>>>>>>>>>>>>> have one opened socket per switch, plus an >>>>>>>>>>>>>>> additional >>>>>>>>>>>>>>> one for the controller. >>>>>>>>>>>>>>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 >>>>>>>>>>>>>>> switches… something is happening causing all >>>>>>>>>>>>>>> connection to >>>>>>>>>>>>>>> be dropped (by device?) and then ODL >>>>>>>>>>>>>>> try to recreate them and goes in a crazy loop where it is >>>>>>>>>>>>>>> never able to re-establish communication, but keeps >>>>>>>>>>>>>>> creating new sockets. >>>>>>>>>>>>>>> I’m suspecting something being garbage collected due to lack >>>>>>>>>>>>>>> of memory, although no OOM errors. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Attached the YourKit Java Profiler analysis for the described >>>>>>>>>>>>>>> scenario and the logs [1]. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Alexis >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]: >>>>>>>>>>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - >>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco) >>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Alexis, >>>>>>>>>>>>>>>> I am not sure how OVS uses threads - in changelog there is >>>>>>>>>>>>>>>> some concurrency related improvement in 2.1.3 >>>>>>>>>>>>>>>> and 2.3. >>>>>>>>>>>>>>>> Also I guess docker can be forced regarding assigned >>>>>>>>>>>>>>>> resources. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For you the most important is the amount of cores used by >>>>>>>>>>>>>>>> controller. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How does your cpu and memory consumption look like when you >>>>>>>>>>>>>>>> connect all the OVSs? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Michal >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ________________________________________ >>>>>>>>>>>>>>>> From: Alexis de Talhouët <[email protected] >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>>>>>> Sent: Tuesday, February 9, 2016 14:44 >>>>>>>>>>>>>>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco) >>>>>>>>>>>>>>>> Cc: [email protected] >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> Subject: Re: [openflowplugin-dev] Scalability issues >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello Michal, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, all the OvS instances I’m running has a unique DPID. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regarding the thread limit for netty, I’m running test in a >>>>>>>>>>>>>>>> server that has 28 CPU(s). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does each OvS instances is assigned its own thread? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Alexis >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - >>>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco) >>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Alexis, >>>>>>>>>>>>>>>>> in Li-design there is the stats manager not in form of >>>>>>>>>>>>>>>>> standalone app but as part of core of ofPlugin. >>>>>>>>>>>>>>>>> You can >>>>>>>>>>>>>>>>> disable it via rpc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just a question regarding your ovs setup. Do you have all >>>>>>>>>>>>>>>>> DPIDs unique? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also there is limit for netty in form of amount of used >>>>>>>>>>>>>>>>> threads. By default it uses 2 x >>>>>>>>>>>>>>>>> cpu_cores_amount. You >>>>>>>>>>>>>>>>> should have as many cores as possible in order to get max >>>>>>>>>>>>>>>>> performance. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Michal >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ________________________________________ >>>>>>>>>>>>>>>>> From: [email protected] >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <[email protected] >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> behalf of Alexis de Talhouët <[email protected] >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>>>>>>> Sent: Tuesday, February 9, 2016 00:45 >>>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>>> Subject: [openflowplugin-dev] Scalability issues >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello openflowplugin-dev, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I’m currently running some scalability test against >>>>>>>>>>>>>>>>> openflowplugin-li plugin, stable/lithium. >>>>>>>>>>>>>>>>> Playing with CSIT job, I was able to connect up to 1090 >>>>>>>>>>>>>>>>> switches: https://git.opendaylight.org/gerrit/#/c/33213/ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I’m now running the test against 40 OvS switches, each one >>>>>>>>>>>>>>>>> of them is in a docker container. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Connecting around 30 of them works fine, but then, adding a >>>>>>>>>>>>>>>>> new one break completely ODL, it goes crazy and >>>>>>>>>>>>>>>>> unresponsible. >>>>>>>>>>>>>>>>> Attach a snippet of the karaf.log with log set to DEBUG for >>>>>>>>>>>>>>>>> org.opendaylight.openflowplugin, thus it’s a >>>>>>>>>>>>>>>>> really >>>>>>>>>>>>>>>>> big log (~2.5MB). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here it what I observed based on the log: >>>>>>>>>>>>>>>>> I have 30 switches connected, all works fine. Then I add a >>>>>>>>>>>>>>>>> new one: >>>>>>>>>>>>>>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 >>>>>>>>>>>>>>>>> 23:13:38,534) >>>>>>>>>>>>>>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 >>>>>>>>>>>>>>>>> 23:13:38,546) >>>>>>>>>>>>>>>>> - ConnectionAdapterImpl Hello received (2016-02-08 >>>>>>>>>>>>>>>>> 23:13:40,520) >>>>>>>>>>>>>>>>> - Creation of the transaction chain, … >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then all starts failing apart with this log: >>>>>>>>>>>>>>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | >>>>>>>>>>>>>>>>>> ConnectionContextImpl | 190 - >>>>>>>>>>>>>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | >>>>>>>>>>>>>>>>>> disconnecting: >>>>>>>>>>>>>>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP >>>>>>>>>>>>>>>>> End then ConnectionContextImpl disconnects one by one the >>>>>>>>>>>>>>>>> switches, RpcManagerImpl is unregistered >>>>>>>>>>>>>>>>> Then it goes crazy for a while. >>>>>>>>>>>>>>>>> But all I’ve done is adding a new switch.. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown: >>>>>>>>>>>>>>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | >>>>>>>>>>>>>>>>>> LocalThreePhaseCommitCohort | 172 - >>>>>>>>>>>>>>>>>> org.opendaylight.controller.sal-distributed-datastore - >>>>>>>>>>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction >>>>>>>>>>>>>>>>>> member-1-chn-5-txn-180 on backend >>>>>>>>>>>>>>>>>> akka.pattern.AskTimeoutException: Ask timed out on >>>>>>>>>>>>>>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/), >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]] >>>>>>>>>>>>>>>>>> after [30000 ms] >>>>>>>>>>>>>>>>> And it goes for a while. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you have any input on the same? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Could you give some advice to be able to scale? (I know >>>>>>>>>>>>>>>>> disabling StatisticManager can help for instance) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Am I doing something wrong? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I can provide any asked information regarding the issue I’m >>>>>>>>>>>>>>>>> facing. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Alexis >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> openflowplugin-dev mailing list >>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> openflowplugin-dev mailing list >>>>>>>>>>>> [email protected] >>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>> <mailto:[email protected] >>>>>>>>>>>> <mailto:[email protected]>> >>>>>>>>>>>> >>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >>>>>>>>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> openflowplugin-dev mailing list >>>>>>> [email protected] >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected] >>>>>>> <mailto:[email protected]>> >>>>>>> <mailto:[email protected] >>>>>>> <mailto:[email protected]>> >>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >>>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev> >>>>>>> >>>>>>> >>>>>> >>>> >>> _______________________________________________ >>> openflowplugin-dev mailing list >>> [email protected] >>> <mailto:[email protected]> >>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev >> >
_______________________________________________ openflowplugin-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
