Re: [openflowplugin-dev] Scalability issues

Jamo Luhrsen Thu, 18 Feb 2016 15:08:33 -0800

inline...

On 02/18/2016 02:58 PM, Alexis de Talhouët wrote:
> I’m running OVS 2.4, against stable/lithium, openflowplugin-li



so this is one difference between CSIT and your setup, in addition to the whole
containers vs mininet.

> I never scaled up to 1k, this was in the CSIT job.
> In a real scenario, I scaled to ~400. But it was even before clustering came 
> into play in ofp lithium.
> 
> I think the log I sent have log trace for openflowplugin and openflowjava, it 
> not the case I could resubmit the logs.
> I removed some of them in openflowjava because it was way to chatty (logging 
> all messages content between ovs <---> odl)
> 
> Unfortunately those IOException happen after the whole thing blow up. I was 
> able to narrow done some logs in openflowjava
> to see the first disconnected event. As mentioned in a previous mail (in this 
> mail thread) it’s the device that is
> issuing the disconnect:
> 
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder          
>>          | 201 -
>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | 
>> skipping bytebuf - too few bytes for header: 0 < 8
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFVersionDetector       
>>          | 201 -
>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | not 
>> enough data
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>> DelegatingInboundHandler         | 201 -
>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | 
>> Channel inactive
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl   
>>          | 201 -
>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | 
>> ConsumeIntern msg on [id: 0x1efab5fb,
>> /172.18.0.49:36983 :> /192.168.1.159:6633]
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionAdapterImpl   
>>          | 201 -
>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | 
>> ConsumeIntern msg - DisconnectEvent
>> 2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | ConnectionContextImpl   
>>          | 205 -
>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting: 
>> node=/172.18.0.49:36983|auxId=0|connection
>> state = RIP
> 
> Those logs come from another run, so are not in the logs I sent earlier. 
> Although the behaviour is always the same.
> 
> Regarding the memory, I don’t want to add more than 2G memory, because, and I 
> tested it, the more memory I add, the more
> I can scale. But as you pointed out, 
> this issue is not OOM error. Thus I rather like failing at 2G (less docker 
> containers to spawn each run ~50).

so, maybe reduce your memory then to simplify the reproducing steps.  Since you 
know that increasing
memory allows you to scale further, but still hit the problem; let's make it 
easier to hit.  how far
can you go with the max mem set to 500M?  if you are only loading ofp-li.

> I definitely need some help here, because I can’t sort myself out in the 
> openflowplugin + openflowjava codebase…
> But I believe I already have Michal’s attention :)

can I use your containers?  do you have any scripts/tools to bring things 
up/down?
I might be able to try and reproduce myself.  I like breaking things :)

JamO


> 
> Thanks,
> Alexis
> 
> 
>> On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>> Alexis,  don't worry about filing a bug just to give us a common place to 
>> work/comment, even
>> if we close it later because of something outside of ODL.  Email is fine too.
>>
>> what ovs version do you have in your containers?  this test sounds great.
>>
>> Luis is right, that if you were scaling well past 1k in the past, but now it 
>> falls over at
>> 50 it sounds like a bug.
>>
>> Oh, you can try increasing the jvm max_mem from default of 2G just as a data 
>> point.  The
>> fact that you don't get OOMs makes me think memory might not be the final 
>> bottleneck.
>>
>> you could enable debug/trace logs in the right modules (need ofp devs to 
>> tell us that)
>> for a little more info.
>>
>> I've seen those IOExceptions before and always assumed it was from an OF 
>> switch doing a
>> hard RST on it's connection.
>>
>>
>> Thanks,
>> JamO
>>
>>
>>
>> On 02/18/2016 11:48 AM, Luis Gomez wrote:
>>> If the same test worked 6-8 months ago this seems like a bug, but please 
>>> feel free to open it whenever you are sure.
>>>
>>>> On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët <[email protected]
>>>> <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>
>>>> Hello Luis,
>>>>
>>>> For sure I’m willing to open a bug but before I want to make sure there is 
>>>> a bug and that I’m not doing something wrong.
>>>> In ODL’s infra, there is a test to find the maximum number of switches 
>>>> that can be connected to ODL, and this test
>>>> reach ~ 500 [0]
>>>> I was able to scale up to 1090 switches [1] using the CSIT job in the 
>>>> sandbox. 
>>>> I believe the CSIT test is different in a way that switches are emulated 
>>>> in one mininet VM, whereas I’m connecting OVS
>>>> instances from separate containers.
>>>>
>>>> 6-8 months ago, I was able to perform the same test, and scale with OVS 
>>>> docker container up to ~400 before ODL start
>>>> crashing (with some optimization done behind the scene, i.e. ulimit, mem, 
>>>> cpu, GC…)
>>>> Now I’m not able to scale more than 100 with the same configuration.
>>>>
>>>> FYI: I just quickly look at the CSIT test [0] karaf.log, it seems the test 
>>>> is actually failing but it is not correctly
>>>> advertised… switch connection are dropped.
>>>> Look for those:
>>>> 016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | OFFrameDecoder         
>>>>           | 181 -
>>>> org.opendaylight.openflowjava.openflow-protocol-impl - 0.6.4.SNAPSHOT | 
>>>> Unexpected exception from downstream.
>>>> java.io.IOException: Connection reset by peer
>>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>>>> at
>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>>>> at 
>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>>>> at
>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>>>> at
>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>>>> at 
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>>>> at
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>>>> at 
>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>>>> at 
>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>>>> at
>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>>>> at
>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>>>> at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>>>>
>>>>
>>>> [0]: 
>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>>>> [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>
>>>>> On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> Alexis, thanks very much for sharing this test. Would you mind to open a 
>>>>> bug with all this info so we can track this?
>>>>>
>>>>>
>>>>>> On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>> Hi Michal,
>>>>>>
>>>>>> ODL memory is capped at 2go, the more memory I add, those more OVS I can 
>>>>>> connect. Regarding CPU, it’s around 10-20%
>>>>>> when connecting new OVS, with some peak to 80%.
>>>>>>
>>>>>> After some investigation, here is what I observed:
>>>>>> Let say I have 50 switches connected, stat manager disabled. I have one 
>>>>>> opened socket per switch, plus an additional
>>>>>> one for the controller.
>>>>>> Then I connect a new switch (2016-02-18 09:35:08,059), 51 switches… 
>>>>>> something is happening causing all connection to
>>>>>> be dropped (by device?) and then ODL
>>>>>> try to recreate them and goes in a crazy loop where it is never able to 
>>>>>> re-establish communication, but keeps
>>>>>> creating new sockets.
>>>>>> I’m suspecting something being garbage collected due to lack of memory, 
>>>>>> although no OOM errors.
>>>>>>
>>>>>> Attached the YourKit Java Profiler analysis for the described scenario 
>>>>>> and the logs [1].
>>>>>>
>>>>>> Thanks,
>>>>>> Alexis
>>>>>>
>>>>>> [1]: 
>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>>>>>>
>>>>>>> On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>>>>> TECHNOLOGIES at Cisco) <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>> Hi Alexis,
>>>>>>> I am not sure how OVS uses threads - in changelog there is some 
>>>>>>> concurrency related improvement in 2.1.3 and 2.3.
>>>>>>> Also I guess docker can be forced regarding assigned resources.
>>>>>>>
>>>>>>> For you the most important is the amount of cores used by controller.
>>>>>>>
>>>>>>> How does your cpu and memory consumption look like when you connect all 
>>>>>>> the OVSs?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Michal
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Alexis de Talhouët <[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> Sent: Tuesday, February 9, 2016 14:44
>>>>>>> To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>>>>> Cc: [email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>>> Subject: Re: [openflowplugin-dev] Scalability issues
>>>>>>>
>>>>>>> Hello Michal,
>>>>>>>
>>>>>>> Yes, all the OvS instances I’m running has a unique DPID.
>>>>>>>
>>>>>>> Regarding the thread limit for netty, I’m running test in a server that 
>>>>>>> has 28 CPU(s).
>>>>>>>
>>>>>>> Does each OvS instances is assigned its own thread?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alexis
>>>>>>>
>>>>>>>
>>>>>>>> On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - PANTHEON 
>>>>>>>> TECHNOLOGIES at Cisco) <[email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>> Hi Alexis,
>>>>>>>> in Li-design there is the stats manager not in form of standalone app 
>>>>>>>> but as part of core of ofPlugin. You can
>>>>>>>> disable it via rpc.
>>>>>>>>
>>>>>>>> Just a question regarding your ovs setup. Do you have all DPIDs unique?
>>>>>>>>
>>>>>>>> Also there is limit for netty in form of amount of used threads. By 
>>>>>>>> default it uses 2 x cpu_cores_amount. You
>>>>>>>> should have as many cores as possible in order to get max performance.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Michal
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: [email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <[email protected] 
>>>>>>>> <mailto:[email protected]>> on
>>>>>>>> behalf of Alexis de Talhouët <[email protected] 
>>>>>>>> <mailto:[email protected]>>
>>>>>>>> Sent: Tuesday, February 9, 2016 00:45
>>>>>>>> To: [email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> Subject: [openflowplugin-dev] Scalability issues
>>>>>>>>
>>>>>>>> Hello openflowplugin-dev,
>>>>>>>>
>>>>>>>> I’m currently running some scalability test against openflowplugin-li 
>>>>>>>> plugin, stable/lithium.
>>>>>>>> Playing with CSIT job, I was able to connect up to 1090 switches: 
>>>>>>>> https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>
>>>>>>>> I’m now running the test against 40 OvS switches, each one of them is 
>>>>>>>> in a docker container.
>>>>>>>>
>>>>>>>> Connecting around 30 of them works fine, but then, adding a new one 
>>>>>>>> break completely ODL, it goes crazy and
>>>>>>>> unresponsible.
>>>>>>>> Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>>>>> org.opendaylight.openflowplugin, thus it’s a really
>>>>>>>> big log (~2.5MB).
>>>>>>>>
>>>>>>>> Here it what I observed based on the log:
>>>>>>>> I have 30 switches connected, all works fine. Then I add a new one:
>>>>>>>> - SalRoleServiceImpl starts doing its thing (2016-02-08 23:13:38,534)
>>>>>>>> - RpcManagerImpl Registering Openflow RPCs (2016-02-08 23:13:38,546)
>>>>>>>> - ConnectionAdapterImpl Hello received (2016-02-08 23:13:40,520)
>>>>>>>> - Creation of the transaction chain, …
>>>>>>>>
>>>>>>>> Then all starts failing apart with this log:
>>>>>>>>> 2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>>>>> ConnectionContextImpl            | 190 -
>>>>>>>>> org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | disconnecting:
>>>>>>>>> node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>>>>> End then ConnectionContextImpl disconnects one by one the switches, 
>>>>>>>> RpcManagerImpl is unregistered
>>>>>>>> Then it goes crazy for a while.
>>>>>>>> But all I’ve done is adding a new switch..
>>>>>>>>
>>>>>>>> Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>>>>> 2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>>>>> LocalThreePhaseCommitCohort      | 172 -
>>>>>>>>> org.opendaylight.controller.sal-distributed-datastore - 
>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction
>>>>>>>>> member-1-chn-5-txn-180 on backend
>>>>>>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>>>>>>> [ActorSelection[Anchor(akka://opendaylight-cluster-data/),
>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>>>>  after [30000 ms]
>>>>>>>> And it goes for a while.
>>>>>>>>
>>>>>>>> Do you have any input on the same?
>>>>>>>>
>>>>>>>> Could you give some advice to be able to scale? (I know disabling 
>>>>>>>> StatisticManager can help for instance)
>>>>>>>>
>>>>>>>> Am I doing something wrong?
>>>>>>>>
>>>>>>>> I can provide any asked information regarding the issue I’m facing.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Alexis
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> openflowplugin-dev mailing list
>>>>>> [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> openflowplugin-dev mailing list
>>> [email protected] 
>>> <mailto:[email protected]>
>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
> 
_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to