Re: [openflowplugin-dev] Scalability issues

Luis Gomez Tue, 15 Mar 2016 11:27:54 -0700

Alexis, did you open a bug with all the information for this? we are releasing 
Be SR1 and I believe we still have serious perf issues with OVS 2.4.


BR/Luis



> On Mar 4, 2016, at 4:56 PM, Jamo Luhrsen <[email protected]> wrote:
> 
> Alexis,
> 
> thanks for the bug and the patch, and keep up the good work digging at
> openflowplugin.
> 
> JamO
> 
> On 03/04/2016 07:38 AM, Alexis de Talhouët wrote:
>> JamO,
>> 
>> Here is the bug: https://bugs.opendaylight.org/show_bug.cgi?id=5464
>> Here is the patch in int/test: https://git.opendaylight.org/gerrit/#/c/35813/
>> It is still WIP. And yes I believe we should have a CSIT job running the 
>> test.
>> 
>> Thanks,
>> Alexis
>>> On Mar 3, 2016, at 12:41 AM, Jamo Luhrsen <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> 
>>> On 02/19/2016 02:10 PM, Alexis de Talhouët wrote:
>>>> So far my results are:
>>>> 
>>>> OVS 2.4.0: ODL configure with 2G of mem —> max is ~50 switches connected
>>>> OVS 2.3.1: ODL configure with 256MG of mem —> I currently have 150 
>>>> switches connected, can’t scale more due to infra
>>>> limits.
>>> 
>>> Alexis, I think this is probably worth putting a bugzilla up.
>>> 
>>> How much horsepower do you need per docker ovs instance?  We need to get 
>>> this
>>> automated in CSIT.  Marcus from ovsdb wants to do similar tests with ovsdb.
>>> 
>>> JamO
>>> 
>>> 
>>>> I will pursue me testing next week.
>>>> 
>>>> Thanks,
>>>> Alexis
>>>> 
>>>>> On Feb 19, 2016, at 5:06 PM, Abhijit Kumbhare <[email protected] 
>>>>> <mailto:[email protected]>
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Interesting. I wonder - why that would be?
>>>>> 
>>>>> On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët 
>>>>> <[email protected] <mailto:[email protected]>
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>   OVS 2.3.x scales fine
>>>>>   OVS 2.4.x doesn’t scale well.
>>>>> 
>>>>>   Here is also the docker file for ovs 2.4.1
>>>>> 
>>>>> 
>>>>> 
>>>>>>   On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët 
>>>>>> <[email protected] <mailto:[email protected]>
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>>>   can I use your containers?  do you have any scripts/tools to bring 
>>>>>>> things up/down?
>>>>>> 
>>>>>>   Sure, attached a tar file containing all scripts / config / dockerfile 
>>>>>> I’m using to setup docker containers
>>>>>>   emulating OvS.
>>>>>>   FYI: it’s ovs 2.3.0 and not 2.4.0 anymore
>>>>>> 
>>>>>>   Also, forget about this whole mail thread, something in my private 
>>>>>> container must be breaking OVS behaviour, I
>>>>>>   don’t know what yet.
>>>>>> 
>>>>>>   With the docker file attached here, I can scale 90+ without any 
>>>>>> trouble...
>>>>>> 
>>>>>>   Thanks,
>>>>>>   Alexis
>>>>>> 
>>>>>>   <ovs_scalability_setup.tar.gz>
>>>>>> 
>>>>>>>   On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>>   inline...
>>>>>>> 
>>>>>>>   On 02/18/2016 02:58 PM, Alexis de Talhouët wrote:
>>>>>>>>   I’m running OVS 2.4, against stable/lithium, openflowplugin-li
>>>>>>> 
>>>>>>> 
>>>>>>>   so this is one difference between CSIT and your setup, in addition to 
>>>>>>> the whole
>>>>>>>   containers vs mininet.
>>>>>>> 
>>>>>>>>   I never scaled up to 1k, this was in the CSIT job.
>>>>>>>>   In a real scenario, I scaled to ~400. But it was even before 
>>>>>>>> clustering came into play in ofp lithium.
>>>>>>>> 
>>>>>>>>   I think the log I sent have log trace for openflowplugin and 
>>>>>>>> openflowjava, it not the case I could resubmit the
>>>>>>>>   logs.
>>>>>>>>   I removed some of them in openflowjava because it was way to chatty 
>>>>>>>> (logging all messages content between ovs
>>>>>>>>   <---> odl)
>>>>>>>> 
>>>>>>>>   Unfortunately those IOException happen after the whole thing blow 
>>>>>>>> up. I was able to narrow done some logs in
>>>>>>>>   openflowjava
>>>>>>>>   to see the first disconnected event. As mentioned in a previous mail 
>>>>>>>> (in this mail thread) it’s the device that is
>>>>>>>>   issuing the disconnect:
>>>>>>>> 
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder 
>>>>>>>>>                   | 201 -
>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>> 0.6.4.SNAPSHOT | skipping bytebuf - too few bytes for
>>>>>>>>>   header: 0 < 8
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>> OFVersionDetector                | 201 -
>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>> 0.6.4.SNAPSHOT | not enough data
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>> DelegatingInboundHandler         | 201 -
>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>> 0.6.4.SNAPSHOT | Channel inactive
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg on [id: 0x1efab5fb,
>>>>>>>>>   /172.18.0.49:36983 <http://172.18.0.49:36983/> :> 
>>>>>>>>> /192.168.1.159:6633 <http://192.168.1.159:6633/>]
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg - DisconnectEvent
>>>>>>>>>   2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>> ConnectionContextImpl            | 205 -
>>>>>>>>>   org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>> disconnecting: node=/172.18.0.49:36983|auxId=0|connection
>>>>>>>>>   state = RIP
>>>>>>>> 
>>>>>>>>   Those logs come from another run, so are not in the logs I sent 
>>>>>>>> earlier. Although the behaviour is always the
>>>>>>>> same.
>>>>>>>> 
>>>>>>>>   Regarding the memory, I don’t want to add more than 2G memory, 
>>>>>>>> because, and I tested it, the more memory I add,
>>>>>>>>   the more
>>>>>>>>   I can scale. But as you pointed out,
>>>>>>>>   this issue is not OOM error. Thus I rather like failing at 2G (less 
>>>>>>>> docker containers to spawn each run ~50).
>>>>>>> 
>>>>>>>   so, maybe reduce your memory then to simplify the reproducing steps.  
>>>>>>> Since you know that increasing
>>>>>>>   memory allows you to scale further, but still hit the problem; let's 
>>>>>>> make it easier to hit.  how far
>>>>>>>   can you go with the max mem set to 500M?  if you are only loading 
>>>>>>> ofp-li.
>>>>>>> 
>>>>>>>>   I definitely need some help here, because I can’t sort myself out in 
>>>>>>>> the openflowplugin + openflowjava codebase…
>>>>>>>>   But I believe I already have Michal’s attention :)
>>>>>>> 
>>>>>>>   can I use your containers?  do you have any scripts/tools to bring 
>>>>>>> things up/down?
>>>>>>>   I might be able to try and reproduce myself.  I like breaking things 
>>>>>>> :)
>>>>>>> 
>>>>>>>   JamO
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>   Thanks,
>>>>>>>>   Alexis
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>   On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] 
>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>   <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>>   Alexis,  don't worry about filing a bug just to give us a common 
>>>>>>>>> place to work/comment, even
>>>>>>>>>   if we close it later because of something outside of ODL.  Email is 
>>>>>>>>> fine too.
>>>>>>>>> 
>>>>>>>>>   what ovs version do you have in your containers?  this test sounds 
>>>>>>>>> great.
>>>>>>>>> 
>>>>>>>>>   Luis is right, that if you were scaling well past 1k in the past, 
>>>>>>>>> but now it falls over at
>>>>>>>>>   50 it sounds like a bug.
>>>>>>>>> 
>>>>>>>>>   Oh, you can try increasing the jvm max_mem from default of 2G just 
>>>>>>>>> as a data point.  The
>>>>>>>>>   fact that you don't get OOMs makes me think memory might not be the 
>>>>>>>>> final bottleneck.
>>>>>>>>> 
>>>>>>>>>   you could enable debug/trace logs in the right modules (need ofp 
>>>>>>>>> devs to tell us that)
>>>>>>>>>   for a little more info.
>>>>>>>>> 
>>>>>>>>>   I've seen those IOExceptions before and always assumed it was from 
>>>>>>>>> an OF switch doing a
>>>>>>>>>   hard RST on it's connection.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>   Thanks,
>>>>>>>>>   JamO
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>   On 02/18/2016 11:48 AM, Luis Gomez wrote:
>>>>>>>>>>   If the same test worked 6-8 months ago this seems like a bug, but 
>>>>>>>>>> please feel free to open it whenever you
>>>>>>>>>>   are sure.
>>>>>>>>>> 
>>>>>>>>>>>   On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët 
>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>   Hello Luis,
>>>>>>>>>>> 
>>>>>>>>>>>   For sure I’m willing to open a bug but before I want to make sure 
>>>>>>>>>>> there is a bug and that I’m not doing
>>>>>>>>>>>   something wrong.
>>>>>>>>>>>   In ODL’s infra, there is a test to find the maximum number of 
>>>>>>>>>>> switches that can be connected to ODL, and
>>>>>>>>>>>   this test
>>>>>>>>>>>   reach ~ 500 [0]
>>>>>>>>>>>   I was able to scale up to 1090 switches [1] using the CSIT job in 
>>>>>>>>>>> the sandbox.
>>>>>>>>>>>   I believe the CSIT test is different in a way that switches are 
>>>>>>>>>>> emulated in one mininet VM, whereas I’m
>>>>>>>>>>>   connecting OVS
>>>>>>>>>>>   instances from separate containers.
>>>>>>>>>>> 
>>>>>>>>>>>   6-8 months ago, I was able to perform the same test, and scale 
>>>>>>>>>>> with OVS docker container up to ~400 before
>>>>>>>>>>>   ODL start
>>>>>>>>>>>   crashing (with some optimization done behind the scene, i.e. 
>>>>>>>>>>> ulimit, mem, cpu, GC…)
>>>>>>>>>>>   Now I’m not able to scale more than 100 with the same 
>>>>>>>>>>> configuration.
>>>>>>>>>>> 
>>>>>>>>>>>   FYI: I just quickly look at the CSIT test [0] karaf.log, it seems 
>>>>>>>>>>> the test is actually failing but it is not
>>>>>>>>>>>   correctly
>>>>>>>>>>>   advertised… switch connection are dropped.
>>>>>>>>>>>   Look for those:
>>>>>>>>>>>   016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | 
>>>>>>>>>>> OFFrameDecoder                   | 181 -
>>>>>>>>>>>   org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | Unexpected exception from downstream.
>>>>>>>>>>>   java.io.IOException: Connection reset by peer
>>>>>>>>>>>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>>>>>>>>>>>   at 
>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>>>>>>>>>>>   at 
>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>>>>>>>>>>>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>>>>>>>>>>>   at 
>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>   at 
>>>>>>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at 
>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>   at
>>>>>>>>>>>   
>>>>>>>>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>   [0]:
>>>>>>>>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>>>>>>>>>>>   [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>> 
>>>>>>>>>>>>   On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>   <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>   Alexis, thanks very much for sharing this test. Would you mind 
>>>>>>>>>>>> to open a bug with all this info so we can
>>>>>>>>>>>>   track this?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>   On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   Hi Michal,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   ODL memory is capped at 2go, the more memory I add, those more 
>>>>>>>>>>>>> OVS I can connect. Regarding CPU, it’s
>>>>>>>>>>>>>   around 10-20%
>>>>>>>>>>>>>   when connecting new OVS, with some peak to 80%.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   After some investigation, here is what I observed:
>>>>>>>>>>>>>   Let say I have 50 switches connected, stat manager disabled. I 
>>>>>>>>>>>>> have one opened socket per switch, plus an
>>>>>>>>>>>>>   additional
>>>>>>>>>>>>>   one for the controller.
>>>>>>>>>>>>>   Then I connect a new switch (2016-02-18 09:35:08,059), 51 
>>>>>>>>>>>>> switches… something is happening causing all
>>>>>>>>>>>>>   connection to
>>>>>>>>>>>>>   be dropped (by device?) and then ODL
>>>>>>>>>>>>>   try to recreate them and goes in a crazy loop where it is never 
>>>>>>>>>>>>> able to re-establish communication, but keeps
>>>>>>>>>>>>>   creating new sockets.
>>>>>>>>>>>>>   I’m suspecting something being garbage collected due to lack of 
>>>>>>>>>>>>> memory, although no OOM errors.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   Attached the YourKit Java Profiler analysis for the described 
>>>>>>>>>>>>> scenario and the logs [1].
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>   Alexis
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   [1]: 
>>>>>>>>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>   <mailto:[email protected]>
>>>>>>>>>>>>>>   <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Hi Alexis,
>>>>>>>>>>>>>>   I am not sure how OVS uses threads - in changelog there is 
>>>>>>>>>>>>>> some concurrency related improvement in 2.1.3
>>>>>>>>>>>>>>   and 2.3.
>>>>>>>>>>>>>>   Also I guess docker can be forced regarding assigned resources.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   For you the most important is the amount of cores used by 
>>>>>>>>>>>>>> controller.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   How does your cpu and memory consumption look like when you 
>>>>>>>>>>>>>> connect all the OVSs?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Regards,
>>>>>>>>>>>>>>   Michal
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   ________________________________________
>>>>>>>>>>>>>>   From: Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>   Sent: Tuesday, February 9, 2016 14:44
>>>>>>>>>>>>>>   To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>   Cc: [email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>   Subject: Re: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Hello Michal,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Yes, all the OvS instances I’m running has a unique DPID.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Regarding the thread limit for netty, I’m running test in a 
>>>>>>>>>>>>>> server that has 28 CPU(s).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Does each OvS instances is assigned its own thread?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>   Alexis
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>   <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Hi Alexis,
>>>>>>>>>>>>>>>   in Li-design there is the stats manager not in form of 
>>>>>>>>>>>>>>> standalone app but as part of core of ofPlugin.
>>>>>>>>>>>>>>>   You can
>>>>>>>>>>>>>>>   disable it via rpc.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Just a question regarding your ovs setup. Do you have all 
>>>>>>>>>>>>>>> DPIDs unique?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Also there is limit for netty in form of amount of used 
>>>>>>>>>>>>>>> threads. By default it uses 2 x
>>>>>>>>>>>>>>>   cpu_cores_amount. You
>>>>>>>>>>>>>>>   should have as many cores as possible in order to get max 
>>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Regards,
>>>>>>>>>>>>>>>   Michal
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   ________________________________________
>>>>>>>>>>>>>>>   From: [email protected]
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <mailto:[email protected]>
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <[email protected]
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <mailto:[email protected]>
>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>   on
>>>>>>>>>>>>>>>   behalf of Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>   Sent: Tuesday, February 9, 2016 00:45
>>>>>>>>>>>>>>>   To: [email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>   Subject: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Hello openflowplugin-dev,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   I’m currently running some scalability test against 
>>>>>>>>>>>>>>> openflowplugin-li plugin, stable/lithium.
>>>>>>>>>>>>>>>   Playing with CSIT job, I was able to connect up to 1090
>>>>>>>>>>>>>>>   switches: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   I’m now running the test against 40 OvS switches, each one of 
>>>>>>>>>>>>>>> them is in a docker container.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Connecting around 30 of them works fine, but then, adding a 
>>>>>>>>>>>>>>> new one break completely ODL, it goes crazy and
>>>>>>>>>>>>>>>   unresponsible.
>>>>>>>>>>>>>>>   Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>>>>>>>>>>>> org.opendaylight.openflowplugin, thus it’s a
>>>>>>>>>>>>>>>   really
>>>>>>>>>>>>>>>   big log (~2.5MB).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Here it what I observed based on the log:
>>>>>>>>>>>>>>>   I have 30 switches connected, all works fine. Then I add a 
>>>>>>>>>>>>>>> new one:
>>>>>>>>>>>>>>>   - SalRoleServiceImpl starts doing its thing (2016-02-08 
>>>>>>>>>>>>>>> 23:13:38,534)
>>>>>>>>>>>>>>>   - RpcManagerImpl Registering Openflow RPCs (2016-02-08 
>>>>>>>>>>>>>>> 23:13:38,546)
>>>>>>>>>>>>>>>   - ConnectionAdapterImpl Hello received (2016-02-08 
>>>>>>>>>>>>>>> 23:13:40,520)
>>>>>>>>>>>>>>>   - Creation of the transaction chain, …
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Then all starts failing apart with this log:
>>>>>>>>>>>>>>>>   2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>>>>>>>>>>>> ConnectionContextImpl            | 190 -
>>>>>>>>>>>>>>>>   org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>>>>>>>>> disconnecting:
>>>>>>>>>>>>>>>>   node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>>>>>>>>>>>>   End then ConnectionContextImpl disconnects one by one the 
>>>>>>>>>>>>>>> switches, RpcManagerImpl is unregistered
>>>>>>>>>>>>>>>   Then it goes crazy for a while.
>>>>>>>>>>>>>>>   But all I’ve done is adding a new switch..
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>>>>>>>>>>>>   2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>>>>>>>>>>>> LocalThreePhaseCommitCohort      | 172 -
>>>>>>>>>>>>>>>>   org.opendaylight.controller.sal-distributed-datastore - 
>>>>>>>>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction
>>>>>>>>>>>>>>>>   member-1-chn-5-txn-180 on backend
>>>>>>>>>>>>>>>>   akka.pattern.AskTimeoutException: Ask timed out on
>>>>>>>>>>>>>>>>   [ActorSelection[Anchor(akka://opendaylight-cluster-data/),
>>>>>>>>>>>>>>>>   
>>>>>>>>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>>>>>>>>>>>  after [30000 ms]
>>>>>>>>>>>>>>>   And it goes for a while.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Do you have any input on the same?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Could you give some advice to be able to scale? (I know 
>>>>>>>>>>>>>>> disabling StatisticManager can help for instance)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Am I doing something wrong?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   I can provide any asked information regarding the issue I’m 
>>>>>>>>>>>>>>> facing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>>   Alexis
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   _______________________________________________
>>>>>>>>>>>>>   openflowplugin-dev mailing list
>>>>>>>>>>>>>   [email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>   
>>>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   _______________________________________________
>>>>>>>>>>   openflowplugin-dev mailing list
>>>>>>>>>>   [email protected] 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>   <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>   https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>> 
>>>>> 
>>>>> 
>>>>>   _______________________________________________
>>>>>   openflowplugin-dev mailing list
>>>>>   [email protected] 
>>>>> <mailto:[email protected]>
>>>>> <mailto:[email protected]>
>>>>>   https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>> 
>>>>> 
>>>> 
>> 
> _______________________________________________
> openflowplugin-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to