Re: [openflowplugin-dev] Scalability issues

Alexis de Talhouët Thu, 18 Feb 2016 11:46:03 -0800

Hello Luis,

For sure I’m willing to open a bug but before I want to make sure there is a 
bug and that I’m not doing something wrong.
In ODL’s infra, there is a test to find the maximum number of switches that can 
be connected to ODL, and this test reach ~ 500 [0]
I was able to scale up to 1090 switches [1] using the CSIT job in the sandbox. 
I believe the CSIT test is different in a way that switches are emulated in one 
mininet VM, whereas I’m connecting OVS instances from separate containers.


6-8 months ago, I was able to perform the same test, and scale with OVS docker 
container up to ~400 before ODL start crashing (with some optimization done 
behind the scene, i.e. ulimit, mem, cpu, GC…)
Now I’m not able to scale more than 100 with the same configuration.

FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test is 
actually failing but it is not correctly advertised… switch connection are 
dropped.
Look for those:
016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | OFFrameDecoder              
     | 181 - org.opendaylight.openflowjava.openflow-protocol-impl - 
0.6.4.SNAPSHOT | Unexpected exception from downstream.
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
        at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
        at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
        at 
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
        at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
        at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
        at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]


[0]: 
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
 
<https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/>
[1]: https://git.opendaylight.org/gerrit/#/c/33213/ 
<https://git.opendaylight.org/gerrit/#/c/33213/>

> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected]> wrote:
> 
> Alexis, thanks very much for sharing this test. Would you mind to open a bug 
> with all this info so we can track this?
> 
> 
>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Michal,
>> 
>> ODL memory is capped at 2go, the more memory I add, those more OVS I can 
>> connect. Regarding CPU, it’s around 10-20% when connecting new OVS, with 
>> some peak to 80%.
>>  
>> After some investigation, here is what I observed:
>> Let say I have 50 switches connected, stat manager disabled. I have one 
>> opened socket per switch, plus an additional one for the controller.
>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… 
>> something is happening causing all connection to be dropped (by device?) and 
>> then ODL
>> try to recreate them and goes in a crazy loop where it is never able to 
>> re-establish communication, but keeps creating new sockets.
>> I’m suspecting something being garbage collected due to lack of memory, 
>> although no OOM errors.
>> 
>> Attached the YourKit Java Profiler analysis for the described scenario and 
>> the logs [1].
>> 
>> Thanks,
>> Alexis
>> 
>> [1]: 
>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0 
>> <https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0>
>>  
>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON 
>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]>> wrote:
>>> 
>>> Hi Alexis,
>>> I am not sure how OVS uses threads - in changelog there is some concurrency 
>>> related improvement in 2.1.3 and 2.3.
>>> Also I guess docker can be forced regarding assigned resources.
>>> 
>>> For you the most important is the amount of cores used by controller.
>>> 
>>> How does your cpu and memory consumption look like when you connect all the 
>>> OVSs?
>>> 
>>> Regards,
>>> Michal
>>> 
>>> ________________________________________
>>> From: Alexis de Talhouët <[email protected] 
>>> <mailto:[email protected]>>
>>> Sent: Tuesday, February 9, 2016 14:44
>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>> Cc: [email protected] 
>>> <mailto:[email protected]>
>>> Subject: Re: [openflowplugin-dev] Scalability issues
>>> 
>>> Hello Michal,
>>> 
>>> Yes, all the OvS instances I’m running has a unique DPID.
>>> 
>>> Regarding the thread limit for netty, I’m running test in a server that has 
>>> 28 CPU(s).
>>> 
>>> Does each OvS instances is assigned its own thread?
>>> 
>>> Thanks,
>>> Alexis
>>> 
>>> 
>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]>> 
>>>> wrote:
>>>> 
>>>> Hi Alexis,
>>>> in Li-design there is the stats manager not in form of standalone app but 
>>>> as part of core of ofPlugin. You can disable it via rpc.
>>>> 
>>>> Just a question regarding your ovs setup. Do you have all DPIDs unique?
>>>> 
>>>> Also there is limit for netty in form of amount of used threads. By 
>>>> default it uses 2 x cpu_cores_amount. You should have as many cores as 
>>>> possible in order to get max performance.
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Michal
>>>> 
>>>> 
>>>> 
>>>> ________________________________________
>>>> From: [email protected] 
>>>> <mailto:[email protected]> 
>>>> <[email protected] 
>>>> <mailto:[email protected]>> on behalf of 
>>>> Alexis de Talhouët <[email protected] 
>>>> <mailto:[email protected]>>
>>>> Sent: Tuesday, February 9, 2016 00:45
>>>> To: [email protected] 
>>>> <mailto:[email protected]>
>>>> Subject: [openflowplugin-dev] Scalability issues
>>>> 
>>>> Hello openflowplugin-dev,
>>>> 
>>>> I’m currently running some scalability test against openflowplugin-li 
>>>> plugin, stable/lithium.
>>>> Playing with CSIT job, I was able to connect up to 1090 switches: 
>>>> https://git.opendaylight.org/gerrit/#/c/33213/ 
>>>> <https://git.opendaylight.org/gerrit/#/c/33213/>
>>>> 
>>>> I’m now running the test against 40 OvS switches, each one of them is in a 
>>>> docker container.
>>>> 
>>>> Connecting around 30 of them works fine, but then, adding a new one break 
>>>> completely ODL, it goes crazy and unresponsible.
>>>> Attach a snippet of the karaf.log with log set to DEBUG for 
>>>> org.opendaylight.openflowplugin, thus it’s a really big log (~2.5MB).
>>>> 
>>>> Here it what I observed based on the log:
>>>> I have 30 switches connected, all works fine. Then I add a new one:
>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534)
>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546)
>>>> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520)
>>>> - Creation of the transaction chain, …
>>>> 
>>>> Then all starts failing apart with this log:
>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>> ConnectionContextImpl            | 190 - 
>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: 
>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>> End then ConnectionContextImpl disconnects one by one the switches, 
>>>> RpcManagerImpl is unregistered
>>>> Then it goes crazy for a while.
>>>> But all I’ve done is adding a new switch..
>>>> 
>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>> LocalThreePhaseCommitCohort      | 172 - 
>>>>> org.opendaylight.controller.sal-distributed-datastore - 1.2.4.SNAPSHOT | 
>>>>> Failed to prepare transaction member-1-chn-5-txn-180 on backend
>>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/ 
>>>>> <akka://opendaylight-cluster-data/>), 
>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>  after [30000 ms]
>>>> And it goes for a while.
>>>> 
>>>> Do you have any input on the same?
>>>> 
>>>> Could you give some advice to be able to scale? (I know disabling 
>>>> StatisticManager can help for instance)
>>>> 
>>>> Am I doing something wrong?
>>>> 
>>>> I can provide any asked information regarding the issue I’m facing.
>>>> 
>>>> Thanks,
>>>> Alexis
>>>> 
>>>> 
>>> 
>> 
>> _______________________________________________
>> openflowplugin-dev mailing list
>> [email protected] 
>> <mailto:[email protected]>
>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to