If the same test worked 6-8 months ago this seems like a bug, but please feel 
free to open it whenever you are sure.

> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët <[email protected]> 
> wrote:
> 
> Hello Luis,
> 
> For sure I’m willing to open a bug but before I want to make sure there is a 
> bug and that I’m not doing something wrong.
> In ODL’s infra, there is a test to find the maximum number of switches that 
> can be connected to ODL, and this test reach ~ 500 [0]
> I was able to scale up to 1090 switches [1] using the CSIT job in the 
> sandbox. 
> I believe the CSIT test is different in a way that switches are emulated in 
> one mininet VM, whereas I’m connecting OVS instances from separate containers.
> 
> 6-8 months ago, I was able to perform the same test, and scale with OVS 
> docker container up to ~400 before ODL start crashing (with some optimization 
> done behind the scene, i.e. ulimit, mem, cpu, GC…)
> Now I’m not able to scale more than 100 with the same configuration.
> 
> FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test is 
> actually failing but it is not correctly advertised… switch connection are 
> dropped.
> Look for those:
> 016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | OFFrameDecoder            
>        | 181 - org.opendaylight.openflowjava.openflow-protocol-impl - 
> 0.6.4.SNAPSHOT | Unexpected exception from downstream.
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>       at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>       at 
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>       at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>       at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>       at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>       at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
> 
> 
> [0]: 
> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>  
> <https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/>
> [1]: https://git.opendaylight.org/gerrit/#/c/33213/ 
> <https://git.opendaylight.org/gerrit/#/c/33213/>
> 
>> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Alexis, thanks very much for sharing this test. Would you mind to open a bug 
>> with all this info so we can track this?
>> 
>> 
>>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Michal,
>>> 
>>> ODL memory is capped at 2go, the more memory I add, those more OVS I can 
>>> connect. Regarding CPU, it’s around 10-20% when connecting new OVS, with 
>>> some peak to 80%.
>>>  
>>> After some investigation, here is what I observed:
>>> Let say I have 50 switches connected, stat manager disabled. I have one 
>>> opened socket per switch, plus an additional one for the controller.
>>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… 
>>> something is happening causing all connection to be dropped (by device?) 
>>> and then ODL
>>> try to recreate them and goes in a crazy loop where it is never able to 
>>> re-establish communication, but keeps creating new sockets.
>>> I’m suspecting something being garbage collected due to lack of memory, 
>>> although no OOM errors.
>>> 
>>> Attached the YourKit Java Profiler analysis for the described scenario and 
>>> the logs [1].
>>> 
>>> Thanks,
>>> Alexis
>>> 
>>> [1]: 
>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0 
>>> <https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0>
>>>  
>>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]>> 
>>>> wrote:
>>>> 
>>>> Hi Alexis,
>>>> I am not sure how OVS uses threads - in changelog there is some 
>>>> concurrency related improvement in 2.1.3 and 2.3.
>>>> Also I guess docker can be forced regarding assigned resources.
>>>> 
>>>> For you the most important is the amount of cores used by controller.
>>>> 
>>>> How does your cpu and memory consumption look like when you connect all 
>>>> the OVSs?
>>>> 
>>>> Regards,
>>>> Michal
>>>> 
>>>> ________________________________________
>>>> From: Alexis de Talhouët <[email protected] 
>>>> <mailto:[email protected]>>
>>>> Sent: Tuesday, February 9, 2016 14:44
>>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>> Cc: [email protected] 
>>>> <mailto:[email protected]>
>>>> Subject: Re: [openflowplugin-dev] Scalability issues
>>>> 
>>>> Hello Michal,
>>>> 
>>>> Yes, all the OvS instances I’m running has a unique DPID.
>>>> 
>>>> Regarding the thread limit for netty, I’m running test in a server that 
>>>> has 28 CPU(s).
>>>> 
>>>> Does each OvS instances is assigned its own thread?
>>>> 
>>>> Thanks,
>>>> Alexis
>>>> 
>>>> 
>>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>>> TECHNOLOGIES at Cisco) <[email protected] <mailto:[email protected]>> 
>>>>> wrote:
>>>>> 
>>>>> Hi Alexis,
>>>>> in Li-design there is the stats manager not in form of standalone app but 
>>>>> as part of core of ofPlugin. You can disable it via rpc.
>>>>> 
>>>>> Just a question regarding your ovs setup. Do you have all DPIDs unique?
>>>>> 
>>>>> Also there is limit for netty in form of amount of used threads. By 
>>>>> default it uses 2 x cpu_cores_amount. You should have as many cores as 
>>>>> possible in order to get max performance.
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Michal
>>>>> 
>>>>> 
>>>>> 
>>>>> ________________________________________
>>>>> From: [email protected] 
>>>>> <mailto:[email protected]> 
>>>>> <[email protected] 
>>>>> <mailto:[email protected]>> on behalf of 
>>>>> Alexis de Talhouët <[email protected] 
>>>>> <mailto:[email protected]>>
>>>>> Sent: Tuesday, February 9, 2016 00:45
>>>>> To: [email protected] 
>>>>> <mailto:[email protected]>
>>>>> Subject: [openflowplugin-dev] Scalability issues
>>>>> 
>>>>> Hello openflowplugin-dev,
>>>>> 
>>>>> I’m currently running some scalability test against openflowplugin-li 
>>>>> plugin, stable/lithium.
>>>>> Playing with CSIT job, I was able to connect up to 1090 switches: 
>>>>> https://git.opendaylight.org/gerrit/#/c/33213/ 
>>>>> <https://git.opendaylight.org/gerrit/#/c/33213/>
>>>>> 
>>>>> I’m now running the test against 40 OvS switches, each one of them is in 
>>>>> a docker container.
>>>>> 
>>>>> Connecting around 30 of them works fine, but then, adding a new one break 
>>>>> completely ODL, it goes crazy and unresponsible.
>>>>> Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>> org.opendaylight.openflowplugin, thus it’s a really big log (~2.5MB).
>>>>> 
>>>>> Here it what I observed based on the log:
>>>>> I have 30 switches connected, all works fine. Then I add a new one:
>>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534)
>>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546)
>>>>> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520)
>>>>> - Creation of the transaction chain, …
>>>>> 
>>>>> Then all starts failing apart with this log:
>>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>> ConnectionContextImpl            | 190 - 
>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: 
>>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>> End then ConnectionContextImpl disconnects one by one the switches, 
>>>>> RpcManagerImpl is unregistered
>>>>> Then it goes crazy for a while.
>>>>> But all I’ve done is adding a new switch..
>>>>> 
>>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>> LocalThreePhaseCommitCohort      | 172 - 
>>>>>> org.opendaylight.controller.sal-distributed-datastore - 1.2.4.SNAPSHOT | 
>>>>>> Failed to prepare transaction member-1-chn-5-txn-180 on backend
>>>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/ 
>>>>>> <akka://opendaylight-cluster-data/>), 
>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>  after [30000 ms]
>>>>> And it goes for a while.
>>>>> 
>>>>> Do you have any input on the same?
>>>>> 
>>>>> Could you give some advice to be able to scale? (I know disabling 
>>>>> StatisticManager can help for instance)
>>>>> 
>>>>> Am I doing something wrong?
>>>>> 
>>>>> I can provide any asked information regarding the issue I’m facing.
>>>>> 
>>>>> Thanks,
>>>>> Alexis
>>>>> 
>>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> openflowplugin-dev mailing list
>>> [email protected] 
>>> <mailto:[email protected]>
>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev 
>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev>
>> 
> 

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
  • Re: [openflowpl... Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
    • Re: [openf... Alexis de Talhouët
      • Re: [o... Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
        • Re... Alexis de Talhouët
          • ... Luis Gomez
            • ... Alexis de Talhouët
              • ... Luis Gomez
                • ... Jamo Luhrsen
                • ... Alexis de Talhouët
                • ... Jamo Luhrsen
                • ... Alexis de Talhouët
                • ... Alexis de Talhouët
                • ... Abhijit Kumbhare
                • ... Alexis de Talhouët
                • ... Alexis de Talhouët
                • ... Jamo Luhrsen
                • ... Jamo Luhrsen

Reply via email to