Hello OVS folks,

We are seeing a strange behavior with ovs-vswitchd when we upgraded
ovn-controller to 24.09 from 23.09.

After ovn-controller is upgraded,  ovs-vswitchd is blocked for a long
time and because  of which ovn-controller can't program the flows.
The  compute node was hosting just a few VMs.  Once the VMs were
migrated to other node, ovs-vswitchd and ovn-controller are back to
normal.

Below are some of the log messages seen in ovs-vswitchd.log.

-------------------------------------------------

2025-06-18T05:32:17.770Z|434224|timeval|WARN|Unreasonably long 59741ms
poll interval (3942ms user, 55611ms system)
2025-06-18T05:32:17.770Z|434225|timeval|WARN|faults: 29 minor, 0 major
2025-06-18T05:32:17.770Z|434226|timeval|WARN|context switches: 0
voluntary, 95 involuntary
2025-06-18T05:32:17.770Z|434227|coverage|INFO|Skipping details of
duplicate event coverage for hash=81106859
2025-06-18T05:32:17.770Z|434228|poll_loop|INFO|wakeup due to [POLLIN]
on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:274 (99% CPU usage)
2025-06-18T05:32:17.770Z|434229|poll_loop|INFO|wakeup due to [POLLIN]
on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:157 (99% CPU usage)
2025-06-18T05:32:17.770Z|434230|poll_loop|INFO|wakeup due to
[POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:153 (99% CPU usage)
2025-06-18T05:32:17.770Z|434231|poll_loop|INFO|wakeup due to [POLLIN]
on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:157 (99% CPU usage)
2025-06-18T05:32:17.774Z|11877|ofproto_dpif_upcall(handler28)|INFO|Dropped
45 log messages in last 60 seconds (most recently, 60 seconds ago) due
to excessive rate
2025-06-18T05:32:17.774Z|07880|ofproto_dpif_upcall(revalidator40)|WARN|Dropped
36 log messages in last 60 seconds (most recently, 60 seconds ago) due
to excessive rate
2025-06-18T05:32:17.774Z|11878|ofproto_dpif_upcall(handler28)|INFO|received
packet on unassociated datapath port 7 (no OpenFlow port for datapath
port 7)
2025-06-18T05:32:17.774Z|07881|ofproto_dpif_upcall(revalidator40)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:d1e6759c-66bb-46ae-be16-2f6cbbbc2839
2025-06-18T05:32:18.275Z|08136|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:cf11e102-9fc6-42b3-aadd-6e5a1a3603a8
2025-06-18T05:32:18.275Z|08137|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:2e069616-631a-491c-ad14-c0460b3ccad2
2025-06-18T05:32:18.275Z|08138|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:ef4f40b5-625e-47ee-9d26-09b816b497bf
2025-06-18T05:32:18.275Z|08139|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:f60d9bbc-4880-41c4-b799-e501b5d9f0a7
2025-06-18T05:32:18.275Z|08140|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:aec47146-a9ec-42fa-97a9-6896017c865a
2025-06-18T05:32:18.275Z|08141|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:e63cac09-4159-4187-8dff-64b432019bb3
2025-06-18T05:32:18.275Z|08142|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:f6984063-81ad-432b-96d9-22dadd9f8edb
2025-06-18T05:32:18.275Z|08143|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:609d6c3f-7093-4dba-bdb9-5607add1fb8a
2025-06-18T05:32:18.275Z|08144|ofproto_dpif_upcall(revalidator46)|WARN|Failed
to acquire udpif_key corresponding to unexpected flow (Invalid
argument): ufid:81887285-1a37-47d5-959a-85fa6a3547cc
2025-06-18T05:32:18.780Z|466262|ovs_rcu(urcu4)|WARN|blocked 1000 ms
waiting for main to quiesce
2025-06-18T05:32:19.779Z|466263|ovs_rcu(urcu4)|WARN|blocked 2000 ms
waiting for main to quiesce
2025-06-18T05:32:21.779Z|466264|ovs_rcu(urcu4)|WARN|blocked 4000 ms
waiting for main to quiesce
2025-06-18T05:32:25.780Z|466265|ovs_rcu(urcu4)|WARN|blocked 8001 ms
waiting for main to quiesce
2025-06-18T05:32:26.092Z|353399|ofproto_dpif_xlate(handler17)|WARN|Dropped
3 log messages in last 105 seconds (most recently, 86 seconds ago) due
to excessive rate
2025-06-18T05:32:26.092Z|353400|ofproto_dpif_xlate(handler17)|WARN|over
4096 resubmit actions on bridge br-int while processing
arp,in_port=4,dl_vlan=120,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:01:04,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=100.100.100.1,arp_tpa=100.100.100.1,arp_op=1,arp_sha=00:00:5e:00:01:04,arp_tha=00:00:00:00:00:00
2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms
waiting for main to quiesce
2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms
waiting for main to quiesce
2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254:
connection dropped (Broken pipe)
2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms
poll interval (3146ms user, 45722ms system)

...
...
2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms
waiting for main to quiesce
2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms
waiting for main to quiesce
2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254:
connection dropped (Broken pipe)
2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms
poll interval (3146ms user, 45722ms system)
2025-06-18T05:33:06.783Z|434234|timeval|WARN|faults: 3 minor, 0 major
2025-06-18T05:33:06.783Z|434235|timeval|WARN|context switches: 2
voluntary, 38 involuntary
2025-06-18T05:33:06.783Z|434236|coverage|INFO|Skipping details of
duplicate event coverage for hash=81106859
2025-06-18T05:33:06.783Z|434237|poll_loop|INFO|wakeup due to [POLLIN]
on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:274 (99% CPU usage)
2025-06-18T05:33:06.783Z|434238|poll_loop|INFO|wakeup due to [POLLIN]
on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:157 (99% CPU usage)
2025-06-18T05:33:06.783Z|434239|poll_loop|INFO|wakeup due to [POLLOUT]
on fd 104 (/var/run/openvswitch/ovs-vswitchd.12738.ctl<->) at
../lib/stream-fd.c:153 (99% CPU usage)
2025-06-18T05:33:06.783Z|434240|poll_loop|INFO|wakeup due to
[POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:153 (99% CPU usage)
2025-06-18T05:33:06.783Z|434241|poll_loop|INFO|wakeup due to [POLLIN]
on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at
../lib/stream-fd.c:157 (99% CPU usage)
2025-06-18T05:33:06.787Z|12945|ofproto_dpif_upcall(handler35)|INFO|Dropped
49 log messages in last 49 seconds (most recently, 49 seconds ago) due
to excessive rate
2025-06-18T05:33:06.787Z|12946|ofproto_dpif_upcall(handler35)|INFO|received
packet on unassociated datapath port 7 (no OpenFlow port for datapath
port 7)
2025-06-18T05:33:06.791Z|434242|rconn|WARN|br-int<->unix#963255:
connection dropped (Broken pipe)

...

------------------------------------

We took core dump when its blocked,  and we see the below stack trace

#0 0x00007f48203205bf in recvmsg () from /lib64/libc.so.6 #1
0x000056195f0be300 in nl_sock_recv__ (sock=0x5619680045a0,
buf=buf@entry=0x7fff259e9070, nsid=nsid@entry=0x0,
wait=wait@entry=false) at ../lib/netlink-socket.c:712 #2
0x000056195f0c22d9 in nl_dump_refill (buffer=<optimized out>,
dump=<optimized out>) at ../lib/netlink-socket.c:1182 #3 nl_dump_next
(dump=dump@entry=0x7fff259e90b0, reply=reply@entry=0x7fff259e9030,
buffer=buffer@entry=0x7fff259e9070) at ../lib/netlink-socket.c:1265 #4
0x000056195f0c2a5e in nl_ct_flush_zone (flush_zone=2472) at
../lib/netlink-conntrack.c:312 #5 0x000056195ef2b970 in
handle_nxt_ct_flush_zone (oh=<optimized out>, ofconn=0x56197640d880)
at ../ofproto/ofproto.c:948 #6 handle_single_part_openflow
(type=OFPTYPE_CT_FLUSH_ZONE, oh=<optimized out>,
ofconn=0x56197640d880) at ../ofproto/ofproto.c:8870 #7 handle_openflow
(ofconn=0x56197640d880, msgs=0x7fff259e93f0) at
../ofproto/ofproto.c:8936 #8 0x000056195ef72106 in ofconn_run
(handle_openflow=0x56195ef2b480 <handle_openflow>,
ofconn=0x56197640d880) at ../ofproto/connmgr.c:1329 #9 connmgr_run
(mgr=0x5619680b10d0, handle_openflow=0x56195ef2b480 <handle_openflow>)
at ../ofproto/connmgr.c:356 #10 0x000056195ef278ee in ofproto_run
(p=<optimized out>) at ../ofproto/ofproto.c:1967 #11
0x000056195ef0c43c in bridge_run__ () at ../vswitchd/bridge.c:3241 #12
0x000056195ef1993e in bridge_run () at ../vswitchd/bridge.c:3300 #13
0x000056195e818de6 in main (argc=<optimized out>, argv=<optimized
out>) at ../vswitchd/ovs-vswitchd.c:130


Looks like something is wrong in the kernel.

Any pointers or suggestions would be great.

Thanks
Numan
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to