Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Ilya Maximets via discuss Tue, 02 May 2023 11:49:38 -0700

On 5/2/23 19:22, Ilya Maximets wrote:
> On 5/2/23 19:04, Vladislav Odintsov via discuss wrote:
>> I ran perf record -F99 -p $(ovsdb-server) -- sleep 30 on ovsdb-server 
>> process during CPU spike. perf report result:
>>
> 
> Could you run it for a couple of minutes during that 5-6 minute window?
> Also, is it a single 5-6 minute poll interval or a several shorter ones
> (from the log)?
> 
> And you seem to miss some debug symbols for libovsdb.
> 
> One potentially quick fix for your setup would be to disable conditional
> monitoring, i.e. set ovn-monitor-all=true.  You can see the condition
> comparison functions are high in the perf output.  It doesn't scale well.


FWIW, conditional monitoring inefficiencies are not all related to
the database server.  There are significant inefficiencies in a way
ovn-controller creates condition clauses:
  https://bugzilla.redhat.com/show_bug.cgi?id=2139194
Since you're removing a port in your setup, that likely triggers
condition change requests from all controllers for which this port
is local.  Assuming you have N ports total in the cluster and
M ports per node, conditions will contain M * 2 clauses and the
server will have to perform N * M * 2 comparisons per controller
on a condition change request.

> 
> Best regards, Ilya Maximets.
> 
>> Samples: 2K of event 'cpu-clock', Event count (approx.): 29989898690
>> Overhead  Command       Shared Object                Symbol
>>   58.71%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] uuid_compare_3way
>>    7.61%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011058
>>    6.60%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
>> ovsdb_atom_compare_3way
>>    5.93%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
>> ovsdb_condition_match_any_clause
>>    2.26%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011070
>>    2.19%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
>> uuid_compare_3way@plt
>>    2.16%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010ffd
>>    1.68%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe8
>>    1.65%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110ec
>>    1.38%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011084
>>    1.31%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001100e
>>    1.25%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011063
>>    1.08%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
>> ovsdb_datum_compare_3way
>>    1.04%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe0
>>    0.67%  ovsdb-server  libc-2.17.so                 [.] _int_malloc
>>    0.54%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001105a
>>    0.51%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
>> ovsdb_monitor_get_update
>>    0.47%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_hash
>>    0.30%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] json_destroy__
>>    0.24%  ovsdb-server  libc-2.17.so                 [.] malloc_consolidate
>>    0.24%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_atom_hash
>>    0.20%  ovsdb-server  [kernel.kallsyms]            [k] __do_softirq
>>    0.17%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] json_string
>>    0.13%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_destroy
>>    0.13%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe5
>>    0.10%  ovsdb-server  libc-2.17.so                 [.] __memset_sse2
>>    0.10%  ovsdb-server  libc-2.17.so                 [.] __strlen_sse2_pminub
>>    0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fea
>>    0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011020
>>    0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011067
>>    0.07%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
>> ovsdb_datum_compare_3way@plt
>>    0.07%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_equals
>>    0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010ff1
>>    0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011025
>>    0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001108b
>>    0.03%  ovsdb-server  [kernel.kallsyms]            [k] __netif_schedule
>>    0.03%  ovsdb-server  libc-2.17.so                 [.] __strncmp_sse42
>>    0.03%  ovsdb-server  libc-2.17.so                 [.] _int_free
>>    0.03%  ovsdb-server  libc-2.17.so                 [.] free
>>    0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_clone
>>    0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
>> ovsdb_datum_from_json
>>    0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] shash_find
>>    0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 0x000000000018f7e2
>>    0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 0x000000000018f7e8
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
>> ovsdb_condition_from_json
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe1
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fff
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001105e
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110c3
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110f2
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c007
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c030
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c05b
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c45f
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c535
>>    0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c644
>>
>>> On 2 May 2023, at 19:45, Ilya Maximets via discuss 
>>> <[email protected]> wrote:
>>>
>>> On 5/2/23 16:45, Vladislav Odintsov wrote:
>>>> Hi Ilya,
>>>>
>>>> let me jump into this thread.
>>>>
>>>> Right now I’m debugging the behaviour of ovn (22.09.x) and ovsdb-server 
>>>> 3.1.0 where one ovsdb update3 request makes ovsdb-server, which acts as a 
>>>> relay for OVN Southbound DB with only 5-6 clients connected to it 
>>>> (ovn-controllers, acting as a central chassis for external access with 
>>>> enabled ha_group for edge LRs), utilize 100% CPU during 5-6 minutes.
>>>> During this time ovsdb relay failes to answer ovsdb inactivity probes and 
>>>> then clients and even upstream ovsdb-servers disconnect this ovsdb relay 
>>>> because of ping probe timeout of 60s. All the probe intervals configured 
>>>> to 60 seconds value (ovsdb-server SB cluster <-> ovsdb SB relay <-> 
>>>> ovn-controller). Earlier I’ve posted a long-read with some problems listed 
>>>> [1].
>>>>
>>>> IIUC, this update is generated by ovn-northd after one LS with only one 
>>>> LSP type router and attached LB is removed.
>>>> You can see the request json here: [2]
>>>> Such updates appear not only if LS/LB is removed but also in some other 
>>>> operations, this is just an example.
>>>> So it seems like ovn-northd re-creates a big dp group and such update for 
>>>> some reason is difficult to handle for ovsdb relay (actually 
>>>> ovn-controllers  also utilize 100% cpu).
>>>>
>>>> Have you seen such behaviour? Maybe you’ve got any suggestion about the 
>>>> reason and a possible fix for such huge load from one update3?
>>>
>>> Such big updates are typically triggered by the fact that
>>> northd doesn't modify datapath groups.  It re-creates them
>>> instead.  I'm actually working on a patch for this since
>>> last week and hope to post it before the soft freeze for 23.06.
>>>
>>> In your case, however, it seems like you have a new datapath
>>> group created for a load balancer.  And it looks like you have
>>> about 1000 switches in the setup, which is a lot, but...
>>>
>>> Looking at the dump [2], it's big for a human, but it's only
>>> about 100 KB in size.  Database shouldn't have a lot of trouble
>>> processing that, especially in 3.1.  So, I'm not sure what
>>> exactly is going on in your case.  If you can profile the relay
>>> server and see what it is doing all that time, that would be
>>> helpful.
>>>
>>> Best regards, Ilya Maximets.
>>>
>>>>
>>>> Thanks.
>>>>
>>>> 1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403699.html
>>>> 2: https://gist.github.com/odivlad/bba4443e589a268a0f389c2972511df3
>>>>
>>>>
>>>>> On 2 May 2023, at 14:49, Ilya Maximets via discuss 
>>>>> <[email protected]> wrote:
>>>>>
>>>>> Form my side, the first option would be increasing the inactivity
>>>>> probe on the ovn-controller side and see if that resolves the issue.
>>>>> Deployments typically have 60+ seconds set, just in case.
>>>>>
>>>>> Also, if you're not already using latest versions of OVS/OVN, upgrade
>>>>> may resolve the issue as well.  For example, OVS 2.17 provides a big
>>>>> performance improvement over previous versions and 3.0 and 3.1 give
>>>>> even more on top.  And with new OVN releases, the southbound database
>>>>> size usually goes down significantly reducing the load on the OVSDB
>>>>> server.  I'd suggest to use releases after OVN 22.09 for large scale
>>>>> deployments.
>>>>>
>>>>> However, if your setup have only one switch with 250 ports and you
>>>>> have an issue, that should not really be related to scale and you
>>>>> need to investigate further on what exactly is happening.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>> On 5/2/23 08:58, Felix Hüttner via discuss wrote:
>>>>>> Hi Gavin,
>>>>>>
>>>>>> we saw similar issues after reaching a certain number of hypervisors. 
>>>>>> This happened because our ovsdb processes ran at 100% cpu utilization 
>>>>>> (and they are not multithreaded).
>>>>>>
>>>>>> Our solutions where:
>>>>>>
>>>>>> 1. If you use ssl on your north-/southbound db. Disable it and add a tls 
>>>>>> terminating reverse proxy (like traefik) in front
>>>>>> 2. Increase the inactivity probe significantly (you might need to change 
>>>>>> it on the ovn-controller and ovsdb side, not sure anymore)
>>>>>> 3. Introduce ovsdb relays and connect the ovn-controllers there.
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Felix Huettner
>>>>>>
>>>>>>  
>>>>>>
>>>>>> *From:* discuss <[email protected]> *On Behalf Of 
>>>>>> *Gavin McKee via discuss
>>>>>> *Sent:* Monday, May 1, 2023 9:20 PM
>>>>>> *To:* ovs-discuss <[email protected]>
>>>>>> *Subject:* [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db 
>>>>>> unstable
>>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> I'm having a pretty bad issue with OVN controller on the hypervisors 
>>>>>> being unable to connect to the OVS SB DB ,
>>>>>>
>>>>>>  
>>>>>>
>>>>>> 2023-05-01T19:13:33.969Z|00541|reconnect|ERR|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>>>> seconds, disconnecting
>>>>>> 2023-05-01T19:13:33.969Z|00542|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connection dropped
>>>>>> 2023-05-01T19:13:43.043Z|00543|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connected
>>>>>> 2023-05-01T19:13:56.115Z|00544|reconnect|ERR|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>>>> seconds, disconnecting
>>>>>> 2023-05-01T19:13:56.115Z|00545|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connection dropped
>>>>>> 2023-05-01T19:14:36.177Z|00546|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connected
>>>>>> 2023-05-01T19:14:44.996Z|00547|jsonrpc|WARN|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: receive error: Connection reset by peer
>>>>>> 2023-05-01T19:14:44.996Z|00548|reconnect|WARN|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connection dropped (Connection reset by peer)
>>>>>> 2023-05-01T19:15:44.131Z|00549|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connected
>>>>>> 2023-05-01T19:15:54.137Z|00550|reconnect|ERR|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>>>> seconds, disconnecting
>>>>>> 2023-05-01T19:15:54.137Z|00551|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connection dropped
>>>>>> 2023-05-01T19:16:02.184Z|00552|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connected
>>>>>> 2023-05-01T19:16:14.488Z|00553|reconnect|ERR|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>>>> seconds, disconnecting
>>>>>> 2023-05-01T19:16:14.488Z|00554|reconnect|INFO|tcp:10.193.1.2:6642 
>>>>>> <http://10.193.1.2:6642/>: connection dropped
>>>>>>
>>>>>>  
>>>>>>
>>>>>> This happened after pushing a configuration to north db for around 250 
>>>>>> logical switch ports.
>>>>>>
>>>>>> Once I turn on the VM's everything goes bad very quickly,
>>>>>>
>>>>>>  
>>>>>>
>>>>>>  
>>>>>>
>>>>>> 2023-05-01T04:27:09.294Z|01947|poll_loop|INFO|wakeup due to [POLLOUT] on 
>>>>>> fd 66 (10.193.200.6:6642 
>>>>>> <http://10.193.200.6:6642/><->10.193.0.102:48794 
>>>>>> <http://10.193.0.102:48794/>) at ../lib/stream-fd.c:153 (100% CPU usage)
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Can anyone provide any guidance how to run down an issue like this ?
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> [email protected]
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vladislav Odintsov
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> [email protected]
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>>
>> Regards,
>> Vladislav Odintsov
>>
>>
>> _______________________________________________
>> discuss mailing list
>> [email protected]
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> 

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Reply via email to