Hi!

We found a strange behavior of OVS which is running on network node in OVN 
cloud.

>From 500 to 3000% CPU on network node consumed by ovs-vswitchd, which holds a 
>logical router with load_balancer for DNS server.

At the same time for the last 30 days memory consumed by OVS increased from 
3.75 GiB to 4.5 GiB and it continue growing.

One our internal tenant provides DNS service based on Octavia LoadBalancer and 
FIP for public service to other tenants.

All tenants use L3gateway routers by default.

On the network node, which holds DNS service logical router, we see a high CPU 
usage by ovs-vswitchd.


There are a lot of WARN messages in OVS log data:

2025-11-05T11:17:39.223Z|11054|timeval(revalidator82)|WARN|Unreasonably long 
6842ms poll interval (13ms user, 6697ms system)

2025-11-05T11:17:39.223Z|11055|timeval(revalidator82)|WARN|faults: 864 minor, 0 
major

2025-11-05T11:17:39.223Z|11056|timeval(revalidator82)|WARN|context switches: 80 
voluntary, 77 involuntary

2025-11-05T11:17:39.226Z|11193|timeval(revalidator68)|WARN|Unreasonably long 
6844ms poll interval (28ms user, 6682ms system)

2025-11-05T11:17:39.226Z|11194|timeval(revalidator68)|WARN|faults: 151 minor, 0 
major

2025-11-05T11:17:39.226Z|11195|timeval(revalidator68)|WARN|context switches: 
101 voluntary, 74 involuntary

2025-11-05T11:17:39.227Z|15386|ofproto_dpif_upcall(revalidator66)|WARN|Spent an 
unreasonably long 6847ms dumping flows

2025-11-05T11:17:39.241Z|10837|ofproto_dpif_upcall(revalidator81)|WARN|Failed 
to acquire udpif_key corresponding to unexpected flow (Invalid argument): 
ufid:11c28bd2-53f8-4058-902b-331c3711f39d

2025-11-05T11:22:35.988Z|235172|ovs_rcu(urcu6)|WARN|blocked 1000 ms waiting for 
main to quiesce

2025-11-05T11:22:36.988Z|235173|ovs_rcu(urcu6)|WARN|blocked 2000 ms waiting for 
main to quiesce

2025-11-05T11:22:38.988Z|235174|ovs_rcu(urcu6)|WARN|blocked 4000 ms waiting for 
main to quiesce

2025-11-05T11:38:32.424Z|16531|poll_loop(handler36)|INFO|wakeup due to 0-ms 
timeout at ofproto/ofproto-dpif-upcall.c:824 (97% CPU usage)





We don't use any type of acceleration like hardware or DPDK.

Network node has 64 CPU cores.

64 handlers and 17 revalidators spawned by default:

pidstat -t -p `pidof ovs-vswitchd` | grep handler | wc -l

64

sudo pidstat -t -p `pidof ovs-vswitchd` | grep reval | wc -l

17





OVS flows count on network node around 1kk:

ovs-ofctl dump-flows br-int | wc -l

1118768



On the other network nodes OVS working fine without overload.

We found that OVS overload directly related to DNS payload which generated by 
other tenants and destined to public FIP related to loadbalancer VIP.

So if we move DNS tenant router to other network node high cpu usage moved 
along with router.

DNS requests rate is 6k qps and overall traffic going through network node is 
around 500Mb/s.



We did a research and found some interesting things:



1. DNS requests punted to CPU



We compare qps rate of DNS requests which equals to 6k qps to dpif and xlate 
actions rate in coverage



ovs-appctl coverage/show | grep -e dpif -e xlate

dpif_port_del              0.0/sec     0.000/sec        0.0022/sec   total: 2026

dpif_port_add              0.0/sec     0.000/sec        0.0003/sec   total: 2322

dpif_flow_put_error       86.8/sec    45.050/sec       70.6692/sec   total: 
240755186

dpif_flow_put            7246.8/sec  7206.933/sec     7174.8089/sec   total: 
16366834535

dpif_flow_get              1.2/sec     0.233/sec        0.6472/sec   total: 
226779

dpif_flow_flush            0.0/sec     0.000/sec        0.0000/sec   total: 1

dpif_flow_del            6579.2/sec  7090.933/sec     7103.9936/sec   total: 
16124260306

dpif_execute_with_help     3.0/sec     3.483/sec        7.1372/sec   total: 
205462904

dpif_execute_error         0.0/sec     0.000/sec        0.0000/sec   total: 1341

dpif_execute             7365.0/sec  7272.983/sec     7801.9311/sec   total: 
29182053592

xlate_actions_too_many_output   0.0/sec     0.000/sec        0.0294/sec   
total: 7499657

xlate_actions            7780.0/sec  7614.700/sec     9589.8731/sec   total: 
26233874538



2. Each DNS request has uniq five tuple and for each DNS request OVS kernel 
part does upcall to userspace part, kernel datapath flows related to DNS 
requests has flag used:never and packets:0



ufid:8fee8246-312b-4f3f-ad01-183c37665254, 
recirc_id(0xe97cf56),dp_hash(0/0),skb_priority(0/0),in_port(ens9f1v0),skb_mark(0/0),ct_state(0x21/0x25),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=client_IP,dst=server_IP,proto=17,tos=0/0,ttl=0/0,frag=no),udp(src=37618,dst=53),
 packets:0, bytes:0, used:never, dp:ovs, 
actions:ct(commit,zone=5933,mark=0x2/0x2,nat(dst=backend_IP10.254.35.179:1053)),recirc(0xffe66f6)





3. Each DNS request generate several conntrack records, 2 for client side and 
two for server side if both client and server tenant routers reside on the same 
network node.



4. OVS revalidators overloaded and dynamic limit almost always is under 100k



ovs-appctl upcall/show

system@ovs-system:

  flows         : (current 82777) (avg 83380) (max 110088) (limit 159451)

  dump duration : 596ms

  ufid enabled : true



  66: (keys 5111)

  67: (keys 5230)

  68: (keys 4969)

  69: (keys 4968)

  70: (keys 4925)

  71: (keys 5055)

  72: (keys 5078)

  73: (keys 4983)

  74: (keys 4854)

  75: (keys 4904)

  76: (keys 4977)

  77: (keys 4976)

  78: (keys 4996)

  79: (keys 5020)

  80: (keys 4981)

  81: (keys 4968)

  82: (keys 5079)



2025-11-05T11:44:29.771Z|06631|ofproto_dpif_upcall(handler46)|WARN|upcall: 
datapath reached the dynamic limit of 53451 flows.

2025-11-05T11:46:56.047Z|10708|ofproto_dpif_upcall(revalidator67)|WARN|Number 
of datapath flows (62132) twice as high as current dynamic flow limit (44238).  
Starting to delete flows unconditionally as an emergency measure.

2025-11-05T11:48:20.219Z|15554|ofproto_dpif_upcall(revalidator66)|WARN|Number 
of datapath flows (63167) twice as high as current dynamic flow limit (33668).  
Starting to delete flows unconditionally as an emergency measure.

2025-11-05T11:48:27.093Z|07220|ofproto_dpif_upcall(handler39)|WARN|upcall: 
datapath reached the dynamic limit of 30834 flows.





5. Cache hit rate is under 80% and lost counter always grow



system@ovs-system:

  lookups: hit:1210150277053 missed:19254866281 lost:31397443

  flows: 64275

  masks: hit:21325877085133 total:665 hit/pkt:17.35

  cache: hit:982655293987 hit-rate:79.93%

  caches:

    masks-cache: size:256



6. We collect a perf data and found that most time revalidators and handlers 
spend in osq_lock:



    10.00%  handler1         [kernel.kallsyms]                          [k] 
osq_lock

     6.32%  handler48        [kernel.kallsyms]                          [k] 
osq_lock

     4.46%  handler46        [kernel.kallsyms]                          [k] 
osq_lock

     3.85%  revalidator75    [kernel.kallsyms]                          [k] 
osq_lock

     3.03%  handler1         [kernel.kallsyms]                          [k] 
masked_flow_lookup

     2.58%  revalidator72    [kernel.kallsyms]                          [k] 
osq_lock

     2.42%  handler17        [kernel.kallsyms]                          [k] 
osq_lock

     2.11%  revalidator74    [kernel.kallsyms]                          [k] 
osq_lock

     1.98%  sleep            [kernel.kallsyms]                          [k] 
masked_flow_lookup

     1.67%  handler1         [kernel.kallsyms]                          [k] 
mutex_spin_on_owner

     1.57%  handler49        [kernel.kallsyms]                          [k] 
osq_lock

     1.53%  revalidator73    [kernel.kallsyms]                          [k] 
osq_lock

     1.44%  revalidator77    [kernel.kallsyms]                          [k] 
osq_lock

     1.41%  sleep            [kernel.kallsyms]                          [k] 
native_safe_halt

     1.39%  handler35        [kernel.kallsyms]                          [k] 
osq_lock







When we simulated the same testbed in lab - OVS works fine without overload 
even with higher rate from 10 to 50k DNS requests on the node with less CPU 
than above.

We appreciate any help to investigate whats wrong with OVS and why its 
overloaded.





ovs version: 3.3.2

ovn version: 24.03.5

cms: openstack caracal

kernel version 5.15.0-94-generic


УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: Это электронное сообщение и любые документы, 
приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем 
Вас о том, что если это сообщение не предназначено Вам, использование, 
копирование, распространение информации, содержащейся в настоящем сообщении, а 
также осуществление любых действий на основе этой информации, строго запрещено. 
Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом 
отправителю по электронной почте и удалите это сообщение.
CONFIDENTIALITY NOTICE: This email and any files attached to it are 
confidential. If you are not the intended recipient you are notified that 
using, copying, distributing or taking any action in reliance on the contents 
of this information is strictly prohibited. If you have received this email in 
error please notify the sender and delete this email.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to