Re: [openflowplugin-dev] Scalability issues

Alexis de Talhouët Tue, 15 Mar 2016 11:30:17 -0700

I did, and started something in int/test but haven’t got the time to finish it.


https://bugs.opendaylight.org/show_bug.cgi?id=5464 
<https://bugs.opendaylight.org/show_bug.cgi?id=5464>
https://git.opendaylight.org/gerrit/#/c/35813/ 
<https://git.opendaylight.org/gerrit/#/c/35813/>

I agree with the serious problem with ovs2.4, but right now i’m trying to solve 
a FD leak in netconf :)

Thanks,
Alexis

> On Mar 15, 2016, at 2:27 PM, Luis Gomez <[email protected]> wrote:
> 
> Alexis, did you open a bug with all the information for this? we are 
> releasing Be SR1 and I believe we still have serious perf issues with OVS 2.4.
> 
> BR/Luis
> 
> 
> 
>> On Mar 4, 2016, at 4:56 PM, Jamo Luhrsen <[email protected]> wrote:
>> 
>> Alexis,
>> 
>> thanks for the bug and the patch, and keep up the good work digging at
>> openflowplugin.
>> 
>> JamO
>> 
>> On 03/04/2016 07:38 AM, Alexis de Talhouët wrote:
>>> JamO,
>>> 
>>> Here is the bug: https://bugs.opendaylight.org/show_bug.cgi?id=5464
>>> Here is the patch in int/test: 
>>> https://git.opendaylight.org/gerrit/#/c/35813/
>>> It is still WIP. And yes I believe we should have a CSIT job running the 
>>> test.
>>> 
>>> Thanks,
>>> Alexis
>>>> On Mar 3, 2016, at 12:41 AM, Jamo Luhrsen <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 02/19/2016 02:10 PM, Alexis de Talhouët wrote:
>>>>> So far my results are:
>>>>> 
>>>>> OVS 2.4.0: ODL configure with 2G of mem —> max is ~50 switches connected
>>>>> OVS 2.3.1: ODL configure with 256MG of mem —> I currently have 150 
>>>>> switches connected, can’t scale more due to infra
>>>>> limits.
>>>> 
>>>> Alexis, I think this is probably worth putting a bugzilla up.
>>>> 
>>>> How much horsepower do you need per docker ovs instance?  We need to get 
>>>> this
>>>> automated in CSIT.  Marcus from ovsdb wants to do similar tests with ovsdb.
>>>> 
>>>> JamO
>>>> 
>>>> 
>>>>> I will pursue me testing next week.
>>>>> 
>>>>> Thanks,
>>>>> Alexis
>>>>> 
>>>>>> On Feb 19, 2016, at 5:06 PM, Abhijit Kumbhare <[email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Interesting. I wonder - why that would be?
>>>>>> 
>>>>>> On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët 
>>>>>> <[email protected] <mailto:[email protected]>
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>>  OVS 2.3.x scales fine
>>>>>>  OVS 2.4.x doesn’t scale well.
>>>>>> 
>>>>>>  Here is also the docker file for ovs 2.4.1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>  On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët 
>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>>>  can I use your containers?  do you have any scripts/tools to bring 
>>>>>>>> things up/down?
>>>>>>> 
>>>>>>>  Sure, attached a tar file containing all scripts / config / dockerfile 
>>>>>>> I’m using to setup docker containers
>>>>>>>  emulating OvS.
>>>>>>>  FYI: it’s ovs 2.3.0 and not 2.4.0 anymore
>>>>>>> 
>>>>>>>  Also, forget about this whole mail thread, something in my private 
>>>>>>> container must be breaking OVS behaviour, I
>>>>>>>  don’t know what yet.
>>>>>>> 
>>>>>>>  With the docker file attached here, I can scale 90+ without any 
>>>>>>> trouble...
>>>>>>> 
>>>>>>>  Thanks,
>>>>>>>  Alexis
>>>>>>> 
>>>>>>>  <ovs_scalability_setup.tar.gz>
>>>>>>> 
>>>>>>>>  On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>>  inline...
>>>>>>>> 
>>>>>>>>  On 02/18/2016 02:58 PM, Alexis de Talhouët wrote:
>>>>>>>>>  I’m running OVS 2.4, against stable/lithium, openflowplugin-li
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  so this is one difference between CSIT and your setup, in addition to 
>>>>>>>> the whole
>>>>>>>>  containers vs mininet.
>>>>>>>> 
>>>>>>>>>  I never scaled up to 1k, this was in the CSIT job.
>>>>>>>>>  In a real scenario, I scaled to ~400. But it was even before 
>>>>>>>>> clustering came into play in ofp lithium.
>>>>>>>>> 
>>>>>>>>>  I think the log I sent have log trace for openflowplugin and 
>>>>>>>>> openflowjava, it not the case I could resubmit the
>>>>>>>>>  logs.
>>>>>>>>>  I removed some of them in openflowjava because it was way to chatty 
>>>>>>>>> (logging all messages content between ovs
>>>>>>>>>  <---> odl)
>>>>>>>>> 
>>>>>>>>>  Unfortunately those IOException happen after the whole thing blow 
>>>>>>>>> up. I was able to narrow done some logs in
>>>>>>>>>  openflowjava
>>>>>>>>>  to see the first disconnected event. As mentioned in a previous mail 
>>>>>>>>> (in this mail thread) it’s the device that is
>>>>>>>>>  issuing the disconnect:
>>>>>>>>> 
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | OFFrameDecoder 
>>>>>>>>>>                   | 201 -
>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>> 0.6.4.SNAPSHOT | skipping bytebuf - too few bytes for
>>>>>>>>>>  header: 0 < 8
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>> OFVersionDetector                | 201 -
>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>> 0.6.4.SNAPSHOT | not enough data
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>> DelegatingInboundHandler         | 201 -
>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>> 0.6.4.SNAPSHOT | Channel inactive
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg on [id: 0x1efab5fb,
>>>>>>>>>>  /172.18.0.49:36983 <http://172.18.0.49:36983/> :> 
>>>>>>>>>> /192.168.1.159:6633 <http://192.168.1.159:6633/>]
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg - DisconnectEvent
>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>> ConnectionContextImpl            | 205 -
>>>>>>>>>>  org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>>> disconnecting: node=/172.18.0.49:36983|auxId=0|connection
>>>>>>>>>>  state = RIP
>>>>>>>>> 
>>>>>>>>>  Those logs come from another run, so are not in the logs I sent 
>>>>>>>>> earlier. Although the behaviour is always the
>>>>>>>>> same.
>>>>>>>>> 
>>>>>>>>>  Regarding the memory, I don’t want to add more than 2G memory, 
>>>>>>>>> because, and I tested it, the more memory I add,
>>>>>>>>>  the more
>>>>>>>>>  I can scale. But as you pointed out,
>>>>>>>>>  this issue is not OOM error. Thus I rather like failing at 2G (less 
>>>>>>>>> docker containers to spawn each run ~50).
>>>>>>>> 
>>>>>>>>  so, maybe reduce your memory then to simplify the reproducing steps.  
>>>>>>>> Since you know that increasing
>>>>>>>>  memory allows you to scale further, but still hit the problem; let's 
>>>>>>>> make it easier to hit.  how far
>>>>>>>>  can you go with the max mem set to 500M?  if you are only loading 
>>>>>>>> ofp-li.
>>>>>>>> 
>>>>>>>>>  I definitely need some help here, because I can’t sort myself out in 
>>>>>>>>> the openflowplugin + openflowjava codebase…
>>>>>>>>>  But I believe I already have Michal’s attention :)
>>>>>>>> 
>>>>>>>>  can I use your containers?  do you have any scripts/tools to bring 
>>>>>>>> things up/down?
>>>>>>>>  I might be able to try and reproduce myself.  I like breaking things 
>>>>>>>> :)
>>>>>>>> 
>>>>>>>>  JamO
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  Thanks,
>>>>>>>>>  Alexis
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>  On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>  <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>>  Alexis,  don't worry about filing a bug just to give us a common 
>>>>>>>>>> place to work/comment, even
>>>>>>>>>>  if we close it later because of something outside of ODL.  Email is 
>>>>>>>>>> fine too.
>>>>>>>>>> 
>>>>>>>>>>  what ovs version do you have in your containers?  this test sounds 
>>>>>>>>>> great.
>>>>>>>>>> 
>>>>>>>>>>  Luis is right, that if you were scaling well past 1k in the past, 
>>>>>>>>>> but now it falls over at
>>>>>>>>>>  50 it sounds like a bug.
>>>>>>>>>> 
>>>>>>>>>>  Oh, you can try increasing the jvm max_mem from default of 2G just 
>>>>>>>>>> as a data point.  The
>>>>>>>>>>  fact that you don't get OOMs makes me think memory might not be the 
>>>>>>>>>> final bottleneck.
>>>>>>>>>> 
>>>>>>>>>>  you could enable debug/trace logs in the right modules (need ofp 
>>>>>>>>>> devs to tell us that)
>>>>>>>>>>  for a little more info.
>>>>>>>>>> 
>>>>>>>>>>  I've seen those IOExceptions before and always assumed it was from 
>>>>>>>>>> an OF switch doing a
>>>>>>>>>>  hard RST on it's connection.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  Thanks,
>>>>>>>>>>  JamO
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  On 02/18/2016 11:48 AM, Luis Gomez wrote:
>>>>>>>>>>>  If the same test worked 6-8 months ago this seems like a bug, but 
>>>>>>>>>>> please feel free to open it whenever you
>>>>>>>>>>>  are sure.
>>>>>>>>>>> 
>>>>>>>>>>>>  On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>  Hello Luis,
>>>>>>>>>>>> 
>>>>>>>>>>>>  For sure I’m willing to open a bug but before I want to make sure 
>>>>>>>>>>>> there is a bug and that I’m not doing
>>>>>>>>>>>>  something wrong.
>>>>>>>>>>>>  In ODL’s infra, there is a test to find the maximum number of 
>>>>>>>>>>>> switches that can be connected to ODL, and
>>>>>>>>>>>>  this test
>>>>>>>>>>>>  reach ~ 500 [0]
>>>>>>>>>>>>  I was able to scale up to 1090 switches [1] using the CSIT job in 
>>>>>>>>>>>> the sandbox.
>>>>>>>>>>>>  I believe the CSIT test is different in a way that switches are 
>>>>>>>>>>>> emulated in one mininet VM, whereas I’m
>>>>>>>>>>>>  connecting OVS
>>>>>>>>>>>>  instances from separate containers.
>>>>>>>>>>>> 
>>>>>>>>>>>>  6-8 months ago, I was able to perform the same test, and scale 
>>>>>>>>>>>> with OVS docker container up to ~400 before
>>>>>>>>>>>>  ODL start
>>>>>>>>>>>>  crashing (with some optimization done behind the scene, i.e. 
>>>>>>>>>>>> ulimit, mem, cpu, GC…)
>>>>>>>>>>>>  Now I’m not able to scale more than 100 with the same 
>>>>>>>>>>>> configuration.
>>>>>>>>>>>> 
>>>>>>>>>>>>  FYI: I just quickly look at the CSIT test [0] karaf.log, it seems 
>>>>>>>>>>>> the test is actually failing but it is not
>>>>>>>>>>>>  correctly
>>>>>>>>>>>>  advertised… switch connection are dropped.
>>>>>>>>>>>>  Look for those:
>>>>>>>>>>>>  016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | 
>>>>>>>>>>>> OFFrameDecoder                   | 181 -
>>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>>> 0.6.4.SNAPSHOT | Unexpected exception from downstream.
>>>>>>>>>>>>  java.io.IOException: Connection reset by peer
>>>>>>>>>>>>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>>>>>>>>>>>>  at 
>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>>>>>>>>>>>>  at 
>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>>>>>>>>>>>>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>>>>>>>>>>>>  at 
>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>>  at 
>>>>>>>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at 
>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>>  at
>>>>>>>>>>>>  
>>>>>>>>>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>  [0]:
>>>>>>>>>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>>>>>>>>>>>>  [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>>> 
>>>>>>>>>>>>>  On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>  <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Alexis, thanks very much for sharing this test. Would you mind 
>>>>>>>>>>>>> to open a bug with all this info so we can
>>>>>>>>>>>>>  track this?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  Hi Michal,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  ODL memory is capped at 2go, the more memory I add, those more 
>>>>>>>>>>>>>> OVS I can connect. Regarding CPU, it’s
>>>>>>>>>>>>>>  around 10-20%
>>>>>>>>>>>>>>  when connecting new OVS, with some peak to 80%.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  After some investigation, here is what I observed:
>>>>>>>>>>>>>>  Let say I have 50 switches connected, stat manager disabled. I 
>>>>>>>>>>>>>> have one opened socket per switch, plus an
>>>>>>>>>>>>>>  additional
>>>>>>>>>>>>>>  one for the controller.
>>>>>>>>>>>>>>  Then I connect a new switch (2016-02-18 09:35:08,059), 51 
>>>>>>>>>>>>>> switches… something is happening causing all
>>>>>>>>>>>>>>  connection to
>>>>>>>>>>>>>>  be dropped (by device?) and then ODL
>>>>>>>>>>>>>>  try to recreate them and goes in a crazy loop where it is never 
>>>>>>>>>>>>>> able to re-establish communication, but keeps
>>>>>>>>>>>>>>  creating new sockets.
>>>>>>>>>>>>>>  I’m suspecting something being garbage collected due to lack of 
>>>>>>>>>>>>>> memory, although no OOM errors.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  Attached the YourKit Java Profiler analysis for the described 
>>>>>>>>>>>>>> scenario and the logs [1].
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  [1]: 
>>>>>>>>>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Hi Alexis,
>>>>>>>>>>>>>>>  I am not sure how OVS uses threads - in changelog there is 
>>>>>>>>>>>>>>> some concurrency related improvement in 2.1.3
>>>>>>>>>>>>>>>  and 2.3.
>>>>>>>>>>>>>>>  Also I guess docker can be forced regarding assigned resources.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  For you the most important is the amount of cores used by 
>>>>>>>>>>>>>>> controller.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  How does your cpu and memory consumption look like when you 
>>>>>>>>>>>>>>> connect all the OVSs?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Regards,
>>>>>>>>>>>>>>>  Michal
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  ________________________________________
>>>>>>>>>>>>>>>  From: Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>  Sent: Tuesday, February 9, 2016 14:44
>>>>>>>>>>>>>>>  To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>  Cc: [email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>  Subject: Re: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Hello Michal,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Yes, all the OvS instances I’m running has a unique DPID.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Regarding the thread limit for netty, I’m running test in a 
>>>>>>>>>>>>>>> server that has 28 CPU(s).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Does each OvS instances is assigned its own thread?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>>  <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Hi Alexis,
>>>>>>>>>>>>>>>>  in Li-design there is the stats manager not in form of 
>>>>>>>>>>>>>>>> standalone app but as part of core of ofPlugin.
>>>>>>>>>>>>>>>>  You can
>>>>>>>>>>>>>>>>  disable it via rpc.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Just a question regarding your ovs setup. Do you have all 
>>>>>>>>>>>>>>>> DPIDs unique?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Also there is limit for netty in form of amount of used 
>>>>>>>>>>>>>>>> threads. By default it uses 2 x
>>>>>>>>>>>>>>>>  cpu_cores_amount. You
>>>>>>>>>>>>>>>>  should have as many cores as possible in order to get max 
>>>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Regards,
>>>>>>>>>>>>>>>>  Michal
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  ________________________________________
>>>>>>>>>>>>>>>>  From: [email protected]
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <[email protected]
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>  behalf of Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>>  Sent: Tuesday, February 9, 2016 00:45
>>>>>>>>>>>>>>>>  To: [email protected] 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  Subject: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Hello openflowplugin-dev,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  I’m currently running some scalability test against 
>>>>>>>>>>>>>>>> openflowplugin-li plugin, stable/lithium.
>>>>>>>>>>>>>>>>  Playing with CSIT job, I was able to connect up to 1090
>>>>>>>>>>>>>>>>  switches: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  I’m now running the test against 40 OvS switches, each one of 
>>>>>>>>>>>>>>>> them is in a docker container.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Connecting around 30 of them works fine, but then, adding a 
>>>>>>>>>>>>>>>> new one break completely ODL, it goes crazy and
>>>>>>>>>>>>>>>>  unresponsible.
>>>>>>>>>>>>>>>>  Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>>>>>>>>>>>>> org.opendaylight.openflowplugin, thus it’s a
>>>>>>>>>>>>>>>>  really
>>>>>>>>>>>>>>>>  big log (~2.5MB).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Here it what I observed based on the log:
>>>>>>>>>>>>>>>>  I have 30 switches connected, all works fine. Then I add a 
>>>>>>>>>>>>>>>> new one:
>>>>>>>>>>>>>>>>  - SalRoleServiceImpl starts doing its thing (2016-02-08 
>>>>>>>>>>>>>>>> 23:13:38,534)
>>>>>>>>>>>>>>>>  - RpcManagerImpl Registering Openflow RPCs (2016-02-08 
>>>>>>>>>>>>>>>> 23:13:38,546)
>>>>>>>>>>>>>>>>  - ConnectionAdapterImpl Hello received (2016-02-08 
>>>>>>>>>>>>>>>> 23:13:40,520)
>>>>>>>>>>>>>>>>  - Creation of the transaction chain, …
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Then all starts failing apart with this log:
>>>>>>>>>>>>>>>>>  2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>>>>>>>>>>>>> ConnectionContextImpl            | 190 -
>>>>>>>>>>>>>>>>>  org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>>>>>>>>>> disconnecting:
>>>>>>>>>>>>>>>>>  node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>>>>>>>>>>>>>  End then ConnectionContextImpl disconnects one by one the 
>>>>>>>>>>>>>>>> switches, RpcManagerImpl is unregistered
>>>>>>>>>>>>>>>>  Then it goes crazy for a while.
>>>>>>>>>>>>>>>>  But all I’ve done is adding a new switch..
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>>>>>>>>>>>>>  2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>>>>>>>>>>>>> LocalThreePhaseCommitCohort      | 172 -
>>>>>>>>>>>>>>>>>  org.opendaylight.controller.sal-distributed-datastore - 
>>>>>>>>>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction
>>>>>>>>>>>>>>>>>  member-1-chn-5-txn-180 on backend
>>>>>>>>>>>>>>>>>  akka.pattern.AskTimeoutException: Ask timed out on
>>>>>>>>>>>>>>>>>  [ActorSelection[Anchor(akka://opendaylight-cluster-data/),
>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>>>>>>>>>>>>  after [30000 ms]
>>>>>>>>>>>>>>>>  And it goes for a while.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Do you have any input on the same?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Could you give some advice to be able to scale? (I know 
>>>>>>>>>>>>>>>> disabling StatisticManager can help for instance)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Am I doing something wrong?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  I can provide any asked information regarding the issue I’m 
>>>>>>>>>>>>>>>> facing.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>  openflowplugin-dev mailing list
>>>>>>>>>>>>>>  [email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>  openflowplugin-dev mailing list
>>>>>>>>>>>  [email protected] 
>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>  https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  _______________________________________________
>>>>>>  openflowplugin-dev mailing list
>>>>>>  [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected]>
>>>>>>  https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> _______________________________________________
>> openflowplugin-dev mailing list
>> [email protected]
>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to