Re: [openflowplugin-dev] Scalability issues

Jamo Luhrsen Thu, 18 Feb 2016 14:44:51 -0800

Alexis,  don't worry about filing a bug just to give us a common place to 
work/comment, even
if we close it later because of something outside of ODL.  Email is fine too.


what ovs version do you have in your containers?  this test sounds great.

Luis is right, that if you were scaling well past 1k in the past, but now it 
falls over at
50 it sounds like a bug.

Oh, you can try increasing the jvm max_mem from default of 2G just as a data 
point.  The
fact that you don't get OOMs makes me think memory might not be the final 
bottleneck.

you could enable debug/trace logs in the right modules (need ofp devs to tell 
us that)
for a little more info.

I've seen those IOExceptions before and always assumed it was from an OF switch 
doing a
hard RST on it's connection.


Thanks,
JamO



On 02/18/2016 11:48 AM, Luis Gomez wrote:
> If the same test worked 6-8 months ago this seems like a bug, but please feel 
> free to open it whenever you are sure.
> 
>> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>> Hello Luis,
>>
>> For sure I’m willing to open a bug but before I want to make sure there is a 
>> bug and that I’m not doing something wrong.
>> In ODL’s infra, there is a test to find the maximum number of switches that 
>> can be connected to ODL, and this test
>> reach ~ 500 [0]
>> I was able to scale up to 1090 switches [1] using the CSIT job in the 
>> sandbox. 
>> I believe the CSIT test is different in a way that switches are emulated in 
>> one mininet VM, whereas I’m connecting OVS
>> instances from separate containers.
>>
>> 6-8 months ago, I was able to perform the same test, and scale with OVS 
>> docker container up to ~400 before ODL start
>> crashing (with some optimization done behind the scene, i.e. ulimit, mem, 
>> cpu, GC…)
>> Now I’m not able to scale more than 100 with the same configuration.
>>
>> FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test 
>> is actually failing but it is not correctly
>> advertised… switch connection are dropped.
>> Look for those:
>> 016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | OFFrameDecoder           
>>         | 181 - org.opendaylight.openflowjava.openflow-protocol-impl - 
>> 0.6.4.SNAPSHOT | Unexpected exception from downstream.
>> java.io.IOException: Connection reset by peer
>>      at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>>      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>>      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>>      at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>>      at 
>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>>      at 
>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>>      at 
>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>>      at 
>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>>      at 
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>>      at 
>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>>      at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>>
>>
>> [0]: 
>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>> [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>>
>>> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>
>>> Alexis, thanks very much for sharing this test. Would you mind to open a 
>>> bug with all this info so we can track this?
>>>
>>>
>>>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>> Hi Michal,
>>>>
>>>> ODL memory is capped at 2go, the more memory I add, those more OVS I can 
>>>> connect. Regarding CPU, it’s around 10-20%
>>>> when connecting new OVS, with some peak to 80%.
>>>>  
>>>> After some investigation, here is what I observed:
>>>> Let say I have 50 switches connected, stat manager disabled. I have one 
>>>> opened socket per switch, plus an additional
>>>> one for the controller.
>>>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… 
>>>> something is happening causing all connection to
>>>> be dropped (by device?) and then ODL
>>>> try to recreate them and goes in a crazy loop where it is never able to 
>>>> re-establish communication, but keeps
>>>> creating new sockets.
>>>> I’m suspecting something being garbage collected due to lack of memory, 
>>>> although no OOM errors.
>>>>
>>>> Attached the YourKit Java Profiler analysis for the described scenario and 
>>>> the logs [1].
>>>>
>>>> Thanks,
>>>> Alexis
>>>>
>>>> [1]: 
>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>>>>  
>>>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>>> TECHNOLOGIES at Cisco) <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> Hi Alexis,
>>>>> I am not sure how OVS uses threads - in changelog there is some 
>>>>> concurrency related improvement in 2.1.3 and 2.3.
>>>>> Also I guess docker can be forced regarding assigned resources.
>>>>>
>>>>> For you the most important is the amount of cores used by controller.
>>>>>
>>>>> How does your cpu and memory consumption look like when you connect all 
>>>>> the OVSs?
>>>>>
>>>>> Regards,
>>>>> Michal
>>>>>
>>>>> ________________________________________
>>>>> From: Alexis de Talhouët <[email protected] 
>>>>> <mailto:[email protected]>>
>>>>> Sent: Tuesday, February 9, 2016 14:44
>>>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>>> Cc: [email protected] 
>>>>> <mailto:[email protected]>
>>>>> Subject: Re: [openflowplugin-dev] Scalability issues
>>>>>
>>>>> Hello Michal,
>>>>>
>>>>> Yes, all the OvS instances I’m running has a unique DPID.
>>>>>
>>>>> Regarding the thread limit for netty, I’m running test in a server that 
>>>>> has 28 CPU(s).
>>>>>
>>>>> Does each OvS instances is assigned its own thread?
>>>>>
>>>>> Thanks,
>>>>> Alexis
>>>>>
>>>>>
>>>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>>>> TECHNOLOGIES at Cisco) <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>> Hi Alexis,
>>>>>> in Li-design there is the stats manager not in form of standalone app 
>>>>>> but as part of core of ofPlugin. You can
>>>>>> disable it via rpc.
>>>>>>
>>>>>> Just a question regarding your ovs setup. Do you have all DPIDs unique?
>>>>>>
>>>>>> Also there is limit for netty in form of amount of used threads. By 
>>>>>> default it uses 2 x cpu_cores_amount. You
>>>>>> should have as many cores as possible in order to get max performance.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Michal
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________________
>>>>>> From: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> <[email protected] 
>>>>>> <mailto:[email protected]>> on
>>>>>> behalf of Alexis de Talhouët <[email protected] 
>>>>>> <mailto:[email protected]>>
>>>>>> Sent: Tuesday, February 9, 2016 00:45
>>>>>> To: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> Subject: [openflowplugin-dev] Scalability issues
>>>>>>
>>>>>> Hello openflowplugin-dev,
>>>>>>
>>>>>> I’m currently running some scalability test against openflowplugin-li 
>>>>>> plugin, stable/lithium.
>>>>>> Playing with CSIT job, I was able to connect up to 1090 switches: 
>>>>>> https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>
>>>>>> I’m now running the test against 40 OvS switches, each one of them is in 
>>>>>> a docker container.
>>>>>>
>>>>>> Connecting around 30 of them works fine, but then, adding a new one 
>>>>>> break completely ODL, it goes crazy and
>>>>>> unresponsible.
>>>>>> Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>>> org.opendaylight.openflowplugin, thus it’s a really
>>>>>> big log (~2.5MB).
>>>>>>
>>>>>> Here it what I observed based on the log:
>>>>>> I have 30 switches connected, all works fine. Then I add a new one:
>>>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534)
>>>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546)
>>>>>> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520)
>>>>>> - Creation of the transaction chain, …
>>>>>>
>>>>>> Then all starts failing apart with this log:
>>>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>>> ConnectionContextImpl            | 190 -
>>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting:
>>>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>>> End then ConnectionContextImpl disconnects one by one the switches, 
>>>>>> RpcManagerImpl is unregistered
>>>>>> Then it goes crazy for a while.
>>>>>> But all I’ve done is adding a new switch..
>>>>>>
>>>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>>> LocalThreePhaseCommitCohort      | 172 -
>>>>>>> org.opendaylight.controller.sal-distributed-datastore - 1.2.4.SNAPSHOT 
>>>>>>> | Failed to prepare transaction
>>>>>>> member-1-chn-5-txn-180 on backend
>>>>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/),
>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>>  after [30000 ms]
>>>>>> And it goes for a while.
>>>>>>
>>>>>> Do you have any input on the same?
>>>>>>
>>>>>> Could you give some advice to be able to scale? (I know disabling 
>>>>>> StatisticManager can help for instance)
>>>>>>
>>>>>> Am I doing something wrong?
>>>>>>
>>>>>> I can provide any asked information regarding the issue I’m facing.
>>>>>>
>>>>>> Thanks,
>>>>>> Alexis
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> openflowplugin-dev mailing list
>>>> [email protected] 
>>>> <mailto:[email protected]>
>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>
>>
> 
> 
> 
> _______________________________________________
> openflowplugin-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
> 
_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to