Re: [openflowplugin-dev] Scalability issues

Abhijit Kumbhare Fri, 19 Feb 2016 14:07:22 -0800

Interesting. I wonder - why that would be?

On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët <[email protected]
> wrote:


> OVS 2.3.x scales fine
> OVS 2.4.x doesn’t scale well.
>
> Here is also the docker file for ovs 2.4.1
>
>
>
> On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët <[email protected]>
> wrote:
>
> can I use your containers?  do you have any scripts/tools to bring things
> up/down?
>
>
> Sure, attached a tar file containing all scripts / config / dockerfile I’m
> using to setup docker containers emulating OvS.
> FYI: it’s ovs 2.3.0 and not 2.4.0 anymore
>
> Also, forget about this whole mail thread, something in my private
> container must be breaking OVS behaviour, I don’t know what yet.
>
> With the docker file attached here, I can scale 90+ without any trouble...
>
> Thanks,
> Alexis
>
> <ovs_scalability_setup.tar.gz>
>
> On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected]> wrote:
>
> inline...
>
> On 02/18/2016 02:58 PM, Alexis de Talhouët wrote:
>
> I’m running OVS 2.4, against stable/lithium, openflowplugin-li
>
>
>
> so this is one difference between CSIT and your setup, in addition to the
> whole
> containers vs mininet.
>
> I never scaled up to 1k, this was in the CSIT job.
> In a real scenario, I scaled to ~400. But it was even before clustering
> came into play in ofp lithium.
>
> I think the log I sent have log trace for openflowplugin and openflowjava,
> it not the case I could resubmit the logs.
> I removed some of them in openflowjava because it was way to chatty
> (logging all messages content between ovs <---> odl)
>
> Unfortunately those IOException happen after the whole thing blow up. I
> was able to narrow done some logs in openflowjava
> to see the first disconnected event. As mentioned in a previous mail (in
> this mail thread) it’s the device that is
> issuing the disconnect:
>
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder
>                   | 201 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> skipping bytebuf - too few bytes for header: 0 < 8
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFVersionDetector
>                | 201 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> not enough data
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 |
> DelegatingInboundHandler         | 201 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> Channel inactive
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl
>            | 201 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> ConsumeIntern msg on [id: 0x1efab5fb,
> /172.18.0.49:36983 :> /192.168.1.159:6633]
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl
>            | 201 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> ConsumeIntern msg - DisconnectEvent
> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionContextImpl
>            | 205 -
> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting:
> node=/172.18.0.49:36983|auxId=0|connection
> state = RIP
>
>
> Those logs come from another run, so are not in the logs I sent earlier.
> Although the behaviour is always the same.
>
> Regarding the memory, I don’t want to add more than 2G memory, because,
> and I tested it, the more memory I add, the more
> I can scale. But as you pointed out,
> this issue is not OOM error. Thus I rather like failing at 2G (less docker
> containers to spawn each run ~50).
>
>
> so, maybe reduce your memory then to simplify the reproducing steps.
> Since you know that increasing
> memory allows you to scale further, but still hit the problem; let's make
> it easier to hit.  how far
> can you go with the max mem set to 500M?  if you are only loading ofp-li.
>
> I definitely need some help here, because I can’t sort myself out in the
> openflowplugin + openflowjava codebase…
> But I believe I already have Michal’s attention :)
>
>
> can I use your containers?  do you have any scripts/tools to bring things
> up/down?
> I might be able to try and reproduce myself.  I like breaking things :)
>
> JamO
>
>
>
> Thanks,
> Alexis
>
>
> On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] <
> mailto:[email protected] <[email protected]>>> wrote:
>
> Alexis,  don't worry about filing a bug just to give us a common place to
> work/comment, even
> if we close it later because of something outside of ODL.  Email is fine
> too.
>
> what ovs version do you have in your containers?  this test sounds great.
>
> Luis is right, that if you were scaling well past 1k in the past, but now
> it falls over at
> 50 it sounds like a bug.
>
> Oh, you can try increasing the jvm max_mem from default of 2G just as a
> data point.  The
> fact that you don't get OOMs makes me think memory might not be the final
> bottleneck.
>
> you could enable debug/trace logs in the right modules (need ofp devs to
> tell us that)
> for a little more info.
>
> I've seen those IOExceptions before and always assumed it was from an OF
> switch doing a
> hard RST on it's connection.
>
>
> Thanks,
> JamO
>
>
>
> On 02/18/2016 11:48 AM, Luis Gomez wrote:
>
> If the same test worked 6-8 months ago this seems like a bug, but please
> feel free to open it whenever you are sure.
>
> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët <[email protected]
> <mailto:[email protected] <[email protected]>> <
> mailto:[email protected] <[email protected]>>> wrote:
>
> Hello Luis,
>
> For sure I’m willing to open a bug but before I want to make sure there is
> a bug and that I’m not doing something wrong.
> In ODL’s infra, there is a test to find the maximum number of switches
> that can be connected to ODL, and this test
> reach ~ 500 [0]
> I was able to scale up to 1090 switches [1] using the CSIT job in the
> sandbox.
> I believe the CSIT test is different in a way that switches are emulated
> in one mininet VM, whereas I’m connecting OVS
> instances from separate containers.
>
> 6-8 months ago, I was able to perform the same test, and scale with OVS
> docker container up to ~400 before ODL start
> crashing (with some optimization done behind the scene, i.e. ulimit, mem,
> cpu, GC…)
> Now I’m not able to scale more than 100 with the same configuration.
>
> FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test
> is actually failing but it is not correctly
> advertised… switch connection are dropped.
> Look for those:
> 016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | OFFrameDecoder
>                   | 181 -
> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT |
> Unexpected exception from downstream.
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
> at
>
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
> at
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
> at
>
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
> at
>
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
> at
>
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
> at
> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
> at
>
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
> at
>
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
> at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>
>
> [0]:
> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
> [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>
> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] <
> mailto:[email protected] <[email protected]>>> wrote:
>
> Alexis, thanks very much for sharing this test. Would you mind to open a
> bug with all this info so we can track this?
>
>
> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] <
> mailto:[email protected] <[email protected]>>> wrote:
>
> Hi Michal,
>
> ODL memory is capped at 2go, the more memory I add, those more OVS I can
> connect. Regarding CPU, it’s around 10-20%
> when connecting new OVS, with some peak to 80%.
>
> After some investigation, here is what I observed:
> Let say I have 50 switches connected, stat manager disabled. I have one
> opened socket per switch, plus an additional
> one for the controller.
> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches…
> something is happening causing all connection to
> be dropped (by device?) and then ODL
> try to recreate them and goes in a crazy loop where it is never able to
> re-establish communication, but keeps
> creating new sockets.
> I’m suspecting something being garbage collected due to lack of memory,
> although no OOM errors.
>
> Attached the YourKit Java Profiler analysis for the described scenario and
> the logs [1].
>
> Thanks,
> Alexis
>
> [1]:
> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>
> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON
> TECHNOLOGIES at Cisco) <[email protected]
> <mailto:[email protected] <[email protected]>>> wrote:
>
> Hi Alexis,
> I am not sure how OVS uses threads - in changelog there is some
> concurrency related improvement in 2.1.3 and 2.3.
> Also I guess docker can be forced regarding assigned resources.
>
> For you the most important is the amount of cores used by controller.
>
> How does your cpu and memory consumption look like when you connect all
> the OVSs?
>
> Regards,
> Michal
>
> ________________________________________
> From: Alexis de Talhouët <[email protected] <
> mailto:[email protected] <[email protected]>>>
> Sent: Tuesday, February 9, 2016 14:44
> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
> Cc: [email protected] <
> mailto:[email protected]
> <[email protected]>>
> Subject: Re: [openflowplugin-dev] Scalability issues
>
> Hello Michal,
>
> Yes, all the OvS instances I’m running has a unique DPID.
>
> Regarding the thread limit for netty, I’m running test in a server that
> has 28 CPU(s).
>
> Does each OvS instances is assigned its own thread?
>
> Thanks,
> Alexis
>
>
> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON
> TECHNOLOGIES at Cisco) <[email protected]
> <mailto:[email protected] <[email protected]>>> wrote:
>
> Hi Alexis,
> in Li-design there is the stats manager not in form of standalone app but
> as part of core of ofPlugin. You can
> disable it via rpc.
>
> Just a question regarding your ovs setup. Do you have all DPIDs unique?
>
> Also there is limit for netty in form of amount of used threads. By
> default it uses 2 x cpu_cores_amount. You
> should have as many cores as possible in order to get max performance.
>
>
>
> Regards,
> Michal
>
>
>
> ________________________________________
> From: [email protected] <
> mailto:[email protected]
> <[email protected]>>
> <[email protected] <
> mailto:[email protected]
> <[email protected]>>> on
> behalf of Alexis de Talhouët <[email protected] <
> mailto:[email protected] <[email protected]>>>
> Sent: Tuesday, February 9, 2016 00:45
> To: [email protected] <
> mailto:[email protected]
> <[email protected]>>
> Subject: [openflowplugin-dev] Scalability issues
>
> Hello openflowplugin-dev,
>
> I’m currently running some scalability test against openflowplugin-li
> plugin, stable/lithium.
> Playing with CSIT job, I was able to connect up to 1090 switches:
> https://git.opendaylight.org/gerrit/#/c/33213/
>
> I’m now running the test against 40 OvS switches, each one of them is in a
> docker container.
>
> Connecting around 30 of them works fine, but then, adding a new one break
> completely ODL, it goes crazy and
> unresponsible.
> Attach a snippet of the karaf.log with log set to DEBUG for
> org.opendaylight.openflowplugin, thus it’s a really
> big log (~2.5MB).
>
> Here it what I observed based on the log:
> I have 30 switches connected, all works fine. Then I add a new one:
> - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534)
> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546)
> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520)
> - Creation of the transaction chain, …
>
> Then all starts failing apart with this log:
>
> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | ConnectionContextImpl
>            | 190 -
> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting:
> node=/172.31.100.9:46736|auxId=0|connection state = RIP
>
> End then ConnectionContextImpl disconnects one by one the switches,
> RpcManagerImpl is unregistered
> Then it goes crazy for a while.
> But all I’ve done is adding a new switch..
>
> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>
> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 |
> LocalThreePhaseCommitCohort      | 172 -
> org.opendaylight.controller.sal-distributed-datastore - 1.2.4.SNAPSHOT |
> Failed to prepare transaction
> member-1-chn-5-txn-180 on backend
> akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(
> akka://opendaylight-cluster-data/),
> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
> after [30000 ms]
>
> And it goes for a while.
>
> Do you have any input on the same?
>
> Could you give some advice to be able to scale? (I know disabling
> StatisticManager can help for instance)
>
> Am I doing something wrong?
>
> I can provide any asked information regarding the issue I’m facing.
>
> Thanks,
> Alexis
>
>
>
>
> _______________________________________________
> openflowplugin-dev mailing list
> [email protected] <
> mailto:[email protected]
> <[email protected]>>
> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>
>
>
>
>
>
> _______________________________________________
> openflowplugin-dev mailing list
> [email protected] <
> mailto:[email protected]
> <[email protected]>>
> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>
>
>
>
> _______________________________________________
> openflowplugin-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>
>

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to