Re: [openflowplugin-dev] Scalability issues

Luis Gomez Tue, 15 Mar 2016 11:46:42 -0700

Thanks, this is enough for now, I raised the priority to critical so it can be 
fixed by next Be SR.


BR/Luis


> On Mar 15, 2016, at 11:29 AM, Alexis de Talhouët <[email protected]> 
> wrote:
> 
> I did, and started something in int/test but haven’t got the time to finish 
> it.
> 
> https://bugs.opendaylight.org/show_bug.cgi?id=5464 
> <https://bugs.opendaylight.org/show_bug.cgi?id=5464>
> https://git.opendaylight.org/gerrit/#/c/35813/ 
> <https://git.opendaylight.org/gerrit/#/c/35813/>
> 
> I agree with the serious problem with ovs2.4, but right now i’m trying to 
> solve a FD leak in netconf :)
> 
> Thanks,
> Alexis
> 
>> On Mar 15, 2016, at 2:27 PM, Luis Gomez <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Alexis, did you open a bug with all the information for this? we are 
>> releasing Be SR1 and I believe we still have serious perf issues with OVS 
>> 2.4.
>> 
>> BR/Luis
>> 
>> 
>> 
>>> On Mar 4, 2016, at 4:56 PM, Jamo Luhrsen <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Alexis,
>>> 
>>> thanks for the bug and the patch, and keep up the good work digging at
>>> openflowplugin.
>>> 
>>> JamO
>>> 
>>> On 03/04/2016 07:38 AM, Alexis de Talhouët wrote:
>>>> JamO,
>>>> 
>>>> Here is the bug: https://bugs.opendaylight.org/show_bug.cgi?id=5464 
>>>> <https://bugs.opendaylight.org/show_bug.cgi?id=5464>
>>>> Here is the patch in int/test: 
>>>> https://git.opendaylight.org/gerrit/#/c/35813/ 
>>>> <https://git.opendaylight.org/gerrit/#/c/35813/>
>>>> It is still WIP. And yes I believe we should have a CSIT job running the 
>>>> test.
>>>> 
>>>> Thanks,
>>>> Alexis
>>>>> On Mar 3, 2016, at 12:41 AM, Jamo Luhrsen <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>> <mailto:[email protected]>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 02/19/2016 02:10 PM, Alexis de Talhouët wrote:
>>>>>> So far my results are:
>>>>>> 
>>>>>> OVS 2.4.0: ODL configure with 2G of mem —> max is ~50 switches connected
>>>>>> OVS 2.3.1: ODL configure with 256MG of mem —> I currently have 150 
>>>>>> switches connected, can’t scale more due to infra
>>>>>> limits.
>>>>> 
>>>>> Alexis, I think this is probably worth putting a bugzilla up.
>>>>> 
>>>>> How much horsepower do you need per docker ovs instance?  We need to get 
>>>>> this
>>>>> automated in CSIT.  Marcus from ovsdb wants to do similar tests with 
>>>>> ovsdb.
>>>>> 
>>>>> JamO
>>>>> 
>>>>> 
>>>>>> I will pursue me testing next week.
>>>>>> 
>>>>>> Thanks,
>>>>>> Alexis
>>>>>> 
>>>>>>> On Feb 19, 2016, at 5:06 PM, Abhijit Kumbhare <[email protected] 
>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>> 
>>>>>>> Interesting. I wonder - why that would be?
>>>>>>> 
>>>>>>> On Fri, Feb 19, 2016 at 1:19 PM, Alexis de Talhouët 
>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>  OVS 2.3.x scales fine
>>>>>>>  OVS 2.4.x doesn’t scale well.
>>>>>>> 
>>>>>>>  Here is also the docker file for ovs 2.4.1
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>  On Feb 19, 2016, at 11:20 AM, Alexis de Talhouët 
>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>  can I use your containers?  do you have any scripts/tools to bring 
>>>>>>>>> things up/down?
>>>>>>>> 
>>>>>>>>  Sure, attached a tar file containing all scripts / config / 
>>>>>>>> dockerfile I’m using to setup docker containers
>>>>>>>>  emulating OvS.
>>>>>>>>  FYI: it’s ovs 2.3.0 and not 2.4.0 anymore
>>>>>>>> 
>>>>>>>>  Also, forget about this whole mail thread, something in my private 
>>>>>>>> container must be breaking OVS behaviour, I
>>>>>>>>  don’t know what yet.
>>>>>>>> 
>>>>>>>>  With the docker file attached here, I can scale 90+ without any 
>>>>>>>> trouble...
>>>>>>>> 
>>>>>>>>  Thanks,
>>>>>>>>  Alexis
>>>>>>>> 
>>>>>>>>  <ovs_scalability_setup.tar.gz>
>>>>>>>> 
>>>>>>>>>  On Feb 18, 2016, at 6:07 PM, Jamo Luhrsen <[email protected] 
>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>> 
>>>>>>>>>  inline...
>>>>>>>>> 
>>>>>>>>>  On 02/18/2016 02:58 PM, Alexis de Talhouët wrote:
>>>>>>>>>>  I’m running OVS 2.4, against stable/lithium, openflowplugin-li
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  so this is one difference between CSIT and your setup, in addition 
>>>>>>>>> to the whole
>>>>>>>>>  containers vs mininet.
>>>>>>>>> 
>>>>>>>>>>  I never scaled up to 1k, this was in the CSIT job.
>>>>>>>>>>  In a real scenario, I scaled to ~400. But it was even before 
>>>>>>>>>> clustering came into play in ofp lithium.
>>>>>>>>>> 
>>>>>>>>>>  I think the log I sent have log trace for openflowplugin and 
>>>>>>>>>> openflowjava, it not the case I could resubmit the
>>>>>>>>>>  logs.
>>>>>>>>>>  I removed some of them in openflowjava because it was way to chatty 
>>>>>>>>>> (logging all messages content between ovs
>>>>>>>>>>  <---> odl)
>>>>>>>>>> 
>>>>>>>>>>  Unfortunately those IOException happen after the whole thing blow 
>>>>>>>>>> up. I was able to narrow done some logs in
>>>>>>>>>>  openflowjava
>>>>>>>>>>  to see the first disconnected event. As mentioned in a previous 
>>>>>>>>>> mail (in this mail thread) it’s the device that is
>>>>>>>>>>  issuing the disconnect:
>>>>>>>>>> 
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> OFFrameDecoder                   | 201 -
>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | skipping bytebuf - too few bytes for
>>>>>>>>>>>  header: 0 < 8
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> OFVersionDetector                | 201 -
>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | not enough data
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> DelegatingInboundHandler         | 201 -
>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | Channel inactive
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg on [id: 0x1efab5fb,
>>>>>>>>>>>  /172.18.0.49:36983 <http://172.18.0.49:36983/ 
>>>>>>>>>>> <http://172.18.0.49:36983/>> :> /192.168.1.159:6633 
>>>>>>>>>>> <http://192.168.1.159:6633/ <http://192.168.1.159:6633/>>]
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> ConnectionAdapterImpl            | 201 -
>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>> 0.6.4.SNAPSHOT | ConsumeIntern msg - DisconnectEvent
>>>>>>>>>>>  2016-02-18 16:56:30,018 | DEBUG | entLoopGroup-6-3 | 
>>>>>>>>>>> ConnectionContextImpl            | 205 -
>>>>>>>>>>>  org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>>>> disconnecting: node=/172.18.0.49:36983|auxId=0|connection
>>>>>>>>>>>  state = RIP
>>>>>>>>>> 
>>>>>>>>>>  Those logs come from another run, so are not in the logs I sent 
>>>>>>>>>> earlier. Although the behaviour is always the
>>>>>>>>>> same.
>>>>>>>>>> 
>>>>>>>>>>  Regarding the memory, I don’t want to add more than 2G memory, 
>>>>>>>>>> because, and I tested it, the more memory I add,
>>>>>>>>>>  the more
>>>>>>>>>>  I can scale. But as you pointed out,
>>>>>>>>>>  this issue is not OOM error. Thus I rather like failing at 2G (less 
>>>>>>>>>> docker containers to spawn each run ~50).
>>>>>>>>> 
>>>>>>>>>  so, maybe reduce your memory then to simplify the reproducing steps. 
>>>>>>>>>  Since you know that increasing
>>>>>>>>>  memory allows you to scale further, but still hit the problem; let's 
>>>>>>>>> make it easier to hit.  how far
>>>>>>>>>  can you go with the max mem set to 500M?  if you are only loading 
>>>>>>>>> ofp-li.
>>>>>>>>> 
>>>>>>>>>>  I definitely need some help here, because I can’t sort myself out 
>>>>>>>>>> in the openflowplugin + openflowjava codebase…
>>>>>>>>>>  But I believe I already have Michal’s attention :)
>>>>>>>>> 
>>>>>>>>>  can I use your containers?  do you have any scripts/tools to bring 
>>>>>>>>> things up/down?
>>>>>>>>>  I might be able to try and reproduce myself.  I like breaking things 
>>>>>>>>> :)
>>>>>>>>> 
>>>>>>>>>  JamO
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Alexis
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>  On Feb 18, 2016, at 5:44 PM, Jamo Luhrsen <[email protected] 
>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>  <mailto:[email protected] <mailto:[email protected]>> 
>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>  Alexis,  don't worry about filing a bug just to give us a common 
>>>>>>>>>>> place to work/comment, even
>>>>>>>>>>>  if we close it later because of something outside of ODL.  Email 
>>>>>>>>>>> is fine too.
>>>>>>>>>>> 
>>>>>>>>>>>  what ovs version do you have in your containers?  this test sounds 
>>>>>>>>>>> great.
>>>>>>>>>>> 
>>>>>>>>>>>  Luis is right, that if you were scaling well past 1k in the past, 
>>>>>>>>>>> but now it falls over at
>>>>>>>>>>>  50 it sounds like a bug.
>>>>>>>>>>> 
>>>>>>>>>>>  Oh, you can try increasing the jvm max_mem from default of 2G just 
>>>>>>>>>>> as a data point.  The
>>>>>>>>>>>  fact that you don't get OOMs makes me think memory might not be 
>>>>>>>>>>> the final bottleneck.
>>>>>>>>>>> 
>>>>>>>>>>>  you could enable debug/trace logs in the right modules (need ofp 
>>>>>>>>>>> devs to tell us that)
>>>>>>>>>>>  for a little more info.
>>>>>>>>>>> 
>>>>>>>>>>>  I've seen those IOExceptions before and always assumed it was from 
>>>>>>>>>>> an OF switch doing a
>>>>>>>>>>>  hard RST on it's connection.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>  JamO
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  On 02/18/2016 11:48 AM, Luis Gomez wrote:
>>>>>>>>>>>>  If the same test worked 6-8 months ago this seems like a bug, but 
>>>>>>>>>>>> please feel free to open it whenever you
>>>>>>>>>>>>  are sure.
>>>>>>>>>>>> 
>>>>>>>>>>>>>  On Feb 18, 2016, at 11:45 AM, Alexis de Talhouët 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>  <mailto:[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>> <mailto:[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Hello Luis,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  For sure I’m willing to open a bug but before I want to make 
>>>>>>>>>>>>> sure there is a bug and that I’m not doing
>>>>>>>>>>>>>  something wrong.
>>>>>>>>>>>>>  In ODL’s infra, there is a test to find the maximum number of 
>>>>>>>>>>>>> switches that can be connected to ODL, and
>>>>>>>>>>>>>  this test
>>>>>>>>>>>>>  reach ~ 500 [0]
>>>>>>>>>>>>>  I was able to scale up to 1090 switches [1] using the CSIT job 
>>>>>>>>>>>>> in the sandbox.
>>>>>>>>>>>>>  I believe the CSIT test is different in a way that switches are 
>>>>>>>>>>>>> emulated in one mininet VM, whereas I’m
>>>>>>>>>>>>>  connecting OVS
>>>>>>>>>>>>>  instances from separate containers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  6-8 months ago, I was able to perform the same test, and scale 
>>>>>>>>>>>>> with OVS docker container up to ~400 before
>>>>>>>>>>>>>  ODL start
>>>>>>>>>>>>>  crashing (with some optimization done behind the scene, i.e. 
>>>>>>>>>>>>> ulimit, mem, cpu, GC…)
>>>>>>>>>>>>>  Now I’m not able to scale more than 100 with the same 
>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  FYI: I just quickly look at the CSIT test [0] karaf.log, it 
>>>>>>>>>>>>> seems the test is actually failing but it is not
>>>>>>>>>>>>>  correctly
>>>>>>>>>>>>>  advertised… switch connection are dropped.
>>>>>>>>>>>>>  Look for those:
>>>>>>>>>>>>>  016-02-18 07:07:51,741 | WARN  | entLoopGroup-6-6 | 
>>>>>>>>>>>>> OFFrameDecoder                   | 181 -
>>>>>>>>>>>>>  org.opendaylight.openflowjava.openflow-protocol-impl - 
>>>>>>>>>>>>> 0.6.4.SNAPSHOT | Unexpected exception from downstream.
>>>>>>>>>>>>>  java.io.IOException: Connection reset by peer
>>>>>>>>>>>>>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.7.0_85]
>>>>>>>>>>>>>  at 
>>>>>>>>>>>>> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.7.0_85]
>>>>>>>>>>>>>  at 
>>>>>>>>>>>>> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.7.0_85]
>>>>>>>>>>>>>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)[:1.7.0_85]
>>>>>>>>>>>>>  at 
>>>>>>>>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)[:1.7.0_85]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>>>  at 
>>>>>>>>>>>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)[111:io.netty.buffer:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at 
>>>>>>>>>>>>> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:349)[109:io.netty.transport:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>  
>>>>>>>>>>>>> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)[110:io.netty.common:4.0.26.Final]
>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:745)[:1.7.0_85]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  [0]:
>>>>>>>>>>>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/
>>>>>>>>>>>>>  
>>>>>>>>>>>>> <https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-stable-lithium/>
>>>>>>>>>>>>>  [1]: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  On Feb 18, 2016, at 2:28 PM, Luis Gomez <[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>  <mailto:[email protected]> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  Alexis, thanks very much for sharing this test. Would you mind 
>>>>>>>>>>>>>> to open a bug with all this info so we can
>>>>>>>>>>>>>>  track this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  On Feb 18, 2016, at 7:29 AM, Alexis de Talhouët 
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Hi Michal,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  ODL memory is capped at 2go, the more memory I add, those more 
>>>>>>>>>>>>>>> OVS I can connect. Regarding CPU, it’s
>>>>>>>>>>>>>>>  around 10-20%
>>>>>>>>>>>>>>>  when connecting new OVS, with some peak to 80%.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  After some investigation, here is what I observed:
>>>>>>>>>>>>>>>  Let say I have 50 switches connected, stat manager disabled. I 
>>>>>>>>>>>>>>> have one opened socket per switch, plus an
>>>>>>>>>>>>>>>  additional
>>>>>>>>>>>>>>>  one for the controller.
>>>>>>>>>>>>>>>  Then I connect a new switch (2016-02-18 09:35:08,059), 51 
>>>>>>>>>>>>>>> switches… something is happening causing all
>>>>>>>>>>>>>>>  connection to
>>>>>>>>>>>>>>>  be dropped (by device?) and then ODL
>>>>>>>>>>>>>>>  try to recreate them and goes in a crazy loop where it is 
>>>>>>>>>>>>>>> never able to re-establish communication, but keeps
>>>>>>>>>>>>>>>  creating new sockets.
>>>>>>>>>>>>>>>  I’m suspecting something being garbage collected due to lack 
>>>>>>>>>>>>>>> of memory, although no OOM errors.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Attached the YourKit Java Profiler analysis for the described 
>>>>>>>>>>>>>>> scenario and the logs [1].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  [1]: 
>>>>>>>>>>>>>>> https://www.dropbox.com/sh/dgqeqv4j76zwbh3/AACim0za1fUozc7DlYJ4fsMJa?dl=0
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  On Feb 9, 2016, at 8:59 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Hi Alexis,
>>>>>>>>>>>>>>>>  I am not sure how OVS uses threads - in changelog there is 
>>>>>>>>>>>>>>>> some concurrency related improvement in 2.1.3
>>>>>>>>>>>>>>>>  and 2.3.
>>>>>>>>>>>>>>>>  Also I guess docker can be forced regarding assigned 
>>>>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  For you the most important is the amount of cores used by 
>>>>>>>>>>>>>>>> controller.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  How does your cpu and memory consumption look like when you 
>>>>>>>>>>>>>>>> connect all the OVSs?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Regards,
>>>>>>>>>>>>>>>>  Michal
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  ________________________________________
>>>>>>>>>>>>>>>>  From: Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>>  Sent: Tuesday, February 9, 2016 14:44
>>>>>>>>>>>>>>>>  To: Michal Rehak -X (mirehak - PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>>  Cc: [email protected] 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>  Subject: Re: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Hello Michal,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Yes, all the OvS instances I’m running has a unique DPID.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Regarding the thread limit for netty, I’m running test in a 
>>>>>>>>>>>>>>>> server that has 28 CPU(s).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Does each OvS instances is assigned its own thread?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  On Feb 9, 2016, at 3:42 AM, Michal Rehak -X (mirehak - 
>>>>>>>>>>>>>>>>> PANTHEON TECHNOLOGIES at Cisco)
>>>>>>>>>>>>>>>>>  <[email protected] <mailto:[email protected]> 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Hi Alexis,
>>>>>>>>>>>>>>>>>  in Li-design there is the stats manager not in form of 
>>>>>>>>>>>>>>>>> standalone app but as part of core of ofPlugin.
>>>>>>>>>>>>>>>>>  You can
>>>>>>>>>>>>>>>>>  disable it via rpc.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Just a question regarding your ovs setup. Do you have all 
>>>>>>>>>>>>>>>>> DPIDs unique?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Also there is limit for netty in form of amount of used 
>>>>>>>>>>>>>>>>> threads. By default it uses 2 x
>>>>>>>>>>>>>>>>>  cpu_cores_amount. You
>>>>>>>>>>>>>>>>>  should have as many cores as possible in order to get max 
>>>>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Regards,
>>>>>>>>>>>>>>>>>  Michal
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  ________________________________________
>>>>>>>>>>>>>>>>>  From: [email protected]
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <[email protected]
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <mailto:[email protected]>
>>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>>>  on
>>>>>>>>>>>>>>>>>  behalf of Alexis de Talhouët <[email protected] 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>>>>>>  Sent: Tuesday, February 9, 2016 00:45
>>>>>>>>>>>>>>>>>  To: [email protected] 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>>>  Subject: [openflowplugin-dev] Scalability issues
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Hello openflowplugin-dev,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  I’m currently running some scalability test against 
>>>>>>>>>>>>>>>>> openflowplugin-li plugin, stable/lithium.
>>>>>>>>>>>>>>>>>  Playing with CSIT job, I was able to connect up to 1090
>>>>>>>>>>>>>>>>>  switches: https://git.opendaylight.org/gerrit/#/c/33213/
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  I’m now running the test against 40 OvS switches, each one 
>>>>>>>>>>>>>>>>> of them is in a docker container.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Connecting around 30 of them works fine, but then, adding a 
>>>>>>>>>>>>>>>>> new one break completely ODL, it goes crazy and
>>>>>>>>>>>>>>>>>  unresponsible.
>>>>>>>>>>>>>>>>>  Attach a snippet of the karaf.log with log set to DEBUG for 
>>>>>>>>>>>>>>>>> org.opendaylight.openflowplugin, thus it’s a
>>>>>>>>>>>>>>>>>  really
>>>>>>>>>>>>>>>>>  big log (~2.5MB).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Here it what I observed based on the log:
>>>>>>>>>>>>>>>>>  I have 30 switches connected, all works fine. Then I add a 
>>>>>>>>>>>>>>>>> new one:
>>>>>>>>>>>>>>>>>  - SalRoleServiceImpl starts doing its thing (2016-02-08 
>>>>>>>>>>>>>>>>> 23:13:38,534)
>>>>>>>>>>>>>>>>>  - RpcManagerImpl Registering Openflow RPCs (2016-02-08 
>>>>>>>>>>>>>>>>> 23:13:38,546)
>>>>>>>>>>>>>>>>>  - ConnectionAdapterImpl Hello received (2016-02-08 
>>>>>>>>>>>>>>>>> 23:13:40,520)
>>>>>>>>>>>>>>>>>  - Creation of the transaction chain, …
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Then all starts failing apart with this log:
>>>>>>>>>>>>>>>>>>  2016-02-08 23:13:50,021 | DEBUG | ntLoopGroup-11-9 | 
>>>>>>>>>>>>>>>>>> ConnectionContextImpl            | 190 -
>>>>>>>>>>>>>>>>>>  org.opendaylight.openflowplugin.impl - 0.1.4.SNAPSHOT | 
>>>>>>>>>>>>>>>>>> disconnecting:
>>>>>>>>>>>>>>>>>>  node=/172.31.100.9:46736|auxId=0|connection state = RIP
>>>>>>>>>>>>>>>>>  End then ConnectionContextImpl disconnects one by one the 
>>>>>>>>>>>>>>>>> switches, RpcManagerImpl is unregistered
>>>>>>>>>>>>>>>>>  Then it goes crazy for a while.
>>>>>>>>>>>>>>>>>  But all I’ve done is adding a new switch..
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Finally, at 2016-02-08 23:14:26,666, exceptions are thrown:
>>>>>>>>>>>>>>>>>>  2016-02-08 23:14:26,666 | ERROR | lt-dispatcher-85 | 
>>>>>>>>>>>>>>>>>> LocalThreePhaseCommitCohort      | 172 -
>>>>>>>>>>>>>>>>>>  org.opendaylight.controller.sal-distributed-datastore - 
>>>>>>>>>>>>>>>>>> 1.2.4.SNAPSHOT | Failed to prepare transaction
>>>>>>>>>>>>>>>>>>  member-1-chn-5-txn-180 on backend
>>>>>>>>>>>>>>>>>>  akka.pattern.AskTimeoutException: Ask timed out on
>>>>>>>>>>>>>>>>>>  [ActorSelection[Anchor(akka://opendaylight-cluster-data/),
>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>> Path(/user/shardmanager-operational/member-1-shard-inventory-operational#-1518836725)]]
>>>>>>>>>>>>>>>>>>  after [30000 ms]
>>>>>>>>>>>>>>>>>  And it goes for a while.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Do you have any input on the same?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Could you give some advice to be able to scale? (I know 
>>>>>>>>>>>>>>>>> disabling StatisticManager can help for instance)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Am I doing something wrong?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  I can provide any asked information regarding the issue I’m 
>>>>>>>>>>>>>>>>> facing.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>>>>  Alexis
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>>  openflowplugin-dev mailing list
>>>>>>>>>>>>>>>  [email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>  <mailto:[email protected]> 
>>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>  openflowplugin-dev mailing list
>>>>>>>>>>>>  [email protected] 
>>>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>>>> <mailto:[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>  <mailto:[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> 
>>>>>>>>>>>> <mailto:[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>>>  
>>>>>>>>>>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev 
>>>>>>>>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev>
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  _______________________________________________
>>>>>>>  openflowplugin-dev mailing list
>>>>>>>  [email protected] 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>> <mailto:[email protected] 
>>>>>>> <mailto:[email protected]>>
>>>>>>>  https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev 
>>>>>>> <https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> _______________________________________________
>>> openflowplugin-dev mailing list
>>> [email protected] 
>>> <mailto:[email protected]>
>>> https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
>> 
>

_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] Scalability issues

Reply via email to