Interesting. I wonder - why that would be? On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët <[email protected] > wrote:
> OVS 2.3.x scales fine > OVS 2.4.x doesn’t scale well. > > Here is also the docker file for ovs 2.4.1 > > > > On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët <[email protected]> > wrote: > > can I use your containers? do you have any scripts/tools to bring things > up/down? > > > Sure, attached a tar file containing all scripts / config / dockerfile I’m > using to setup docker containers emulating OvS. > FYI: it’s ovs 2.3.0 and not 2.4.0 anymore > > Also, forget about this whole mail thread, something in my private > container must be breaking OVS behaviour, I don’t know what yet. > > With the docker file attached here, I can scale 90+ without any trouble... > > Thanks, > Alexis > > <ovs_scalability_setup.tar.gz> > > On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected]> wrote: > > inline... > > On 02/18/2016 02:58 PM, Alexis de Talhouët wrote: > > I’m running OVS 2.4, against stable/lithium, openflowplugin-li > > > > so this is one difference between CSIT and your setup, in addition to the > whole > containers vs mininet. > > I never scaled up to 1k, this was in the CSIT job. > In a real scenario, I scaled to ~400. But it was even before clustering > came into play in ofp lithium. > > I think the log I sent have log trace for openflowplugin and openflowjava, > it not the case I could resubmit the logs. > I removed some of them in openflowjava because it was way to chatty > (logging all messages content between ovs <---> odl) > > Unfortunately those IOException happen after the whole thing blow up. I > was able to narrow done some logs in openflowjava > to see the first disconnected event. As mentioned in a previous mail (in > this mail thread) it’s the device that is > issuing the disconnect: > > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder > | 201 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > skipping bytebuf - too few bytes for header: 0 < 8 > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFVersionDetector > | 201 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > not enough data > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | > DelegatingInboundHandler | 201 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > Channel inactive > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl > | 201 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > ConsumeIntern msg on [id: 0x1efab5fb, > /172.18.0.49:36983 :> /192.168.1.159:6633] > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl > | 201 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > ConsumeIntern msg - DisconnectEvent > 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionContextImpl > | 205 - > org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: > node=/172.18.0.49:36983|auxId=0|connection > state = RIP > > > Those logs come from another run, so are not in the logs I sent earlier. > Although the behaviour is always the same. > > Regarding the memory, I don’t want to add more than 2G memory, because, > and I tested it, the more memory I add, the more > I can scale. But as you pointed out, > this issue is not OOM error. Thus I rather like failing at 2G (less docker > containers to spawn each run ~50). > > > so, maybe reduce your memory then to simplify the reproducing steps. > Since you know that increasing > memory allows you to scale further, but still hit the problem; let's make > it easier to hit. how far > can you go with the max mem set to 500M? if you are only loading ofp-li. > > I definitely need some help here, because I can’t sort myself out in the > openflowplugin + openflowjava codebase… > But I believe I already have Michal’s attention :) > > > can I use your containers? do you have any scripts/tools to bring things > up/down? > I might be able to try and reproduce myself. I like breaking things :) > > JamO > > > > Thanks, > Alexis > > > On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] < > mailto:[email protected] <[email protected]>>> wrote: > > Alexis, don't worry about filing a bug just to give us a common place to > work/comment, even > if we close it later because of something outside of ODL. Email is fine > too. > > what ovs version do you have in your containers? this test sounds great. > > Luis is right, that if you were scaling well past 1k in the past, but now > it falls over at > 50 it sounds like a bug. > > Oh, you can try increasing the jvm max_mem from default of 2G just as a > data point. The > fact that you don't get OOMs makes me think memory might not be the final > bottleneck. > > you could enable debug/trace logs in the right modules (need ofp devs to > tell us that) > for a little more info. > > I've seen those IOExceptions before and always assumed it was from an OF > switch doing a > hard RST on it's connection. > > > Thanks, > JamO > > > > On 02/18/2016 11:48 AM, Luis Gomez wrote: > > If the same test worked 6-8 months ago this seems like a bug, but please > feel free to open it whenever you are sure. > > On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët <[email protected] > <mailto:[email protected] <[email protected]>> < > mailto:[email protected] <[email protected]>>> wrote: > > Hello Luis, > > For sure I’m willing to open a bug but before I want to make sure there is > a bug and that I’m not doing something wrong. > In ODL’s infra, there is a test to find the maximum number of switches > that can be connected to ODL, and this test > reach ~ 500 [0] > I was able to scale up to 1090 switches [1] using the CSIT job in the > sandbox. > I believe the CSIT test is different in a way that switches are emulated > in one mininet VM, whereas I’m connecting OVS > instances from separate containers. > > 6-8 months ago, I was able to perform the same test, and scale with OVS > docker container up to ~400 before ODL start > crashing (with some optimization done behind the scene, i.e. ulimit, mem, > cpu, GC…) > Now I’m not able to scale more than 100 with the same configuration. > > FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test > is actually failing but it is not correctly > advertised… switch connection are dropped. > Look for those: > 016-02-18 07:07:51,741 | WARN | entLoopGroup-6-6 | OFFrameDecoder > | 181 - > org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | > Unexpected exception from downstream. > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85] > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85] > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85] > at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85] > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85] > at > > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final] > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final] > at > > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final] > at > > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final] > at > > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final] > at > io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final] > at > > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final] > at > > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final] > at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] > > > [0]: > https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/ > [1]: https://git.opendaylight.org/gerrit/#/c/33213/ > > On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] < > mailto:[email protected] <[email protected]>>> wrote: > > Alexis, thanks very much for sharing this test. Would you mind to open a > bug with all this info so we can track this? > > > On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] < > mailto:[email protected] <[email protected]>>> wrote: > > Hi Michal, > > ODL memory is capped at 2go, the more memory I add, those more OVS I can > connect. Regarding CPU, it’s around 10-20% > when connecting new OVS, with some peak to 80%. > > After some investigation, here is what I observed: > Let say I have 50 switches connected, stat manager disabled. I have one > opened socket per switch, plus an additional > one for the controller. > Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… > something is happening causing all connection to > be dropped (by device?) and then ODL > try to recreate them and goes in a crazy loop where it is never able to > re-establish communication, but keeps > creating new sockets. > I’m suspecting something being garbage collected due to lack of memory, > although no OOM errors. > > Attached the YourKit Java Profiler analysis for the described scenario and > the logs [1]. > > Thanks, > Alexis > > [1]: > https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0 > > On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON > TECHNOLOGIES at Cisco) <[email protected] > <mailto:[email protected] <[email protected]>>> wrote: > > Hi Alexis, > I am not sure how OVS uses threads - in changelog there is some > concurrency related improvement in 2.1.3 and 2.3. > Also I guess docker can be forced regarding assigned resources. > > For you the most important is the amount of cores used by controller. > > How does your cpu and memory consumption look like when you connect all > the OVSs? > > Regards, > Michal > > ________________________________________ > From: Alexis de Talhouët <[email protected] < > mailto:[email protected] <[email protected]>>> > Sent: Tuesday, February 9, 2016 14:44 > To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco) > Cc: [email protected] < > mailto:[email protected] > <[email protected]>> > Subject: Re: [openflowplugin-dev] Scalability issues > > Hello Michal, > > Yes, all the OvS instances I’m running has a unique DPID. > > Regarding the thread limit for netty, I’m running test in a server that > has 28 CPU(s). > > Does each OvS instances is assigned its own thread? > > Thanks, > Alexis > > > On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON > TECHNOLOGIES at Cisco) <[email protected] > <mailto:[email protected] <[email protected]>>> wrote: > > Hi Alexis, > in Li-design there is the stats manager not in form of standalone app but > as part of core of ofPlugin. You can > disable it via rpc. > > Just a question regarding your ovs setup. Do you have all DPIDs unique? > > Also there is limit for netty in form of amount of used threads. By > default it uses 2 x cpu_cores_amount. You > should have as many cores as possible in order to get max performance. > > > > Regards, > Michal > > > > ________________________________________ > From: [email protected] < > mailto:[email protected] > <[email protected]>> > <[email protected] < > mailto:[email protected] > <[email protected]>>> on > behalf of Alexis de Talhouët <[email protected] < > mailto:[email protected] <[email protected]>>> > Sent: Tuesday, February 9, 2016 00:45 > To: [email protected] < > mailto:[email protected] > <[email protected]>> > Subject: [openflowplugin-dev] Scalability issues > > Hello openflowplugin-dev, > > I’m currently running some scalability test against openflowplugin-li > plugin, stable/lithium. > Playing with CSIT job, I was able to connect up to 1090 switches: > https://git.opendaylight.org/gerrit/#/c/33213/ > > I’m now running the test against 40 OvS switches, each one of them is in a > docker container. > > Connecting around 30 of them works fine, but then, adding a new one break > completely ODL, it goes crazy and > unresponsible. > Attach a snippet of the karaf.log with log set to DEBUG for > org.opendaylight.openflowplugin, thus it’s a really > big log (~2.5MB). > > Here it what I observed based on the log: > I have 30 switches connected, all works fine. Then I add a new one: > - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534) > - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546) > - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520) > - Creation of the transaction chain, … > > Then all starts failing apart with this log: > > 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | ConnectionContextImpl > | 190 - > org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: > node=/172.31.100.9:46736|auxId=0|connection state = RIP > > End then ConnectionContextImpl disconnects one by one the switches, > RpcManagerImpl is unregistered > Then it goes crazy for a while. > But all I’ve done is adding a new switch.. > > Finally, at 2016-02-08 23:14:26,666, exceptions are thrown: > > 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | > LocalThreePhaseCommitCohort | 172 - > org.opendaylight.controller.sal-distributed-datastore - 1.2.4.SNAPSHOT | > Failed to prepare transaction > member-1-chn-5-txn-180 on backend > akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor( > akka://opendaylight-cluster-data/), > Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]] > after [30000 ms] > > And it goes for a while. > > Do you have any input on the same? > > Could you give some advice to be able to scale? (I know disabling > StatisticManager can help for instance) > > Am I doing something wrong? > > I can provide any asked information regarding the issue I’m facing. > > Thanks, > Alexis > > > > > _______________________________________________ > openflowplugin-dev mailing list > [email protected] < > mailto:[email protected] > <[email protected]>> > https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev > > > > > > > _______________________________________________ > openflowplugin-dev mailing list > [email protected] < > mailto:[email protected] > <[email protected]>> > https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev > > > > > _______________________________________________ > openflowplugin-dev mailing list > [email protected] > https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev > >
_______________________________________________ openflowplugin-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
