On 6/19/25 9:08 AM, Eelco Chaudron via dev wrote: > > > On 19 Jun 2025, at 2:22, Numan Siddique wrote: > >> Hello OVS folks, >> >> We are seeing a strange behavior with ovs-vswitchd when we upgraded >> ovn-controller to 24.09 from 23.09. >> >> After ovn-controller is upgraded, ovs-vswitchd is blocked for a long >> time and because of which ovn-controller can't program the flows. >> The compute node was hosting just a few VMs. Once the VMs were >> migrated to other node, ovs-vswitchd and ovn-controller are back to >> normal. >> >> Below are some of the log messages seen in ovs-vswitchd.log. >> >> ------------------------------------------------- >> >> 2025-06-18T05:32:17.770Z|434224|timeval|WARN|Unreasonably long 59741ms >> poll interval (3942ms user, 55611ms system) >> 2025-06-18T05:32:17.770Z|434225|timeval|WARN|faults: 29 minor, 0 major >> 2025-06-18T05:32:17.770Z|434226|timeval|WARN|context switches: 0 >> voluntary, 95 involuntary >> 2025-06-18T05:32:17.770Z|434227|coverage|INFO|Skipping details of >> duplicate event coverage for hash=81106859 >> 2025-06-18T05:32:17.770Z|434228|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:274 (99% CPU usage) >> 2025-06-18T05:32:17.770Z|434229|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:157 (99% CPU usage) >> 2025-06-18T05:32:17.770Z|434230|poll_loop|INFO|wakeup due to >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:153 (99% CPU usage) >> 2025-06-18T05:32:17.770Z|434231|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:157 (99% CPU usage) >> 2025-06-18T05:32:17.774Z|11877|ofproto_dpif_upcall(handler28)|INFO|Dropped >> 45 log messages in last 60 seconds (most recently, 60 seconds ago) due >> to excessive rate >> 2025-06-18T05:32:17.774Z|07880|ofproto_dpif_upcall(revalidator40)|WARN|Dropped >> 36 log messages in last 60 seconds (most recently, 60 seconds ago) due >> to excessive rate >> 2025-06-18T05:32:17.774Z|11878|ofproto_dpif_upcall(handler28)|INFO|received >> packet on unassociated datapath port 7 (no OpenFlow port for datapath >> port 7) >> 2025-06-18T05:32:17.774Z|07881|ofproto_dpif_upcall(revalidator40)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:d1e6759c-66bb-46ae-be16-2f6cbbbc2839 >> 2025-06-18T05:32:18.275Z|08136|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:cf11e102-9fc6-42b3-aadd-6e5a1a3603a8 >> 2025-06-18T05:32:18.275Z|08137|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:2e069616-631a-491c-ad14-c0460b3ccad2 >> 2025-06-18T05:32:18.275Z|08138|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:ef4f40b5-625e-47ee-9d26-09b816b497bf >> 2025-06-18T05:32:18.275Z|08139|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:f60d9bbc-4880-41c4-b799-e501b5d9f0a7 >> 2025-06-18T05:32:18.275Z|08140|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:aec47146-a9ec-42fa-97a9-6896017c865a >> 2025-06-18T05:32:18.275Z|08141|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:e63cac09-4159-4187-8dff-64b432019bb3 >> 2025-06-18T05:32:18.275Z|08142|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:f6984063-81ad-432b-96d9-22dadd9f8edb >> 2025-06-18T05:32:18.275Z|08143|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:609d6c3f-7093-4dba-bdb9-5607add1fb8a >> 2025-06-18T05:32:18.275Z|08144|ofproto_dpif_upcall(revalidator46)|WARN|Failed >> to acquire udpif_key corresponding to unexpected flow (Invalid >> argument): ufid:81887285-1a37-47d5-959a-85fa6a3547cc >> 2025-06-18T05:32:18.780Z|466262|ovs_rcu(urcu4)|WARN|blocked 1000 ms >> waiting for main to quiesce >> 2025-06-18T05:32:19.779Z|466263|ovs_rcu(urcu4)|WARN|blocked 2000 ms >> waiting for main to quiesce >> 2025-06-18T05:32:21.779Z|466264|ovs_rcu(urcu4)|WARN|blocked 4000 ms >> waiting for main to quiesce >> 2025-06-18T05:32:25.780Z|466265|ovs_rcu(urcu4)|WARN|blocked 8001 ms >> waiting for main to quiesce >> 2025-06-18T05:32:26.092Z|353399|ofproto_dpif_xlate(handler17)|WARN|Dropped >> 3 log messages in last 105 seconds (most recently, 86 seconds ago) due >> to excessive rate >> 2025-06-18T05:32:26.092Z|353400|ofproto_dpif_xlate(handler17)|WARN|over >> 4096 resubmit actions on bridge br-int while processing >> arp,in_port=4,dl_vlan=120,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:01:04,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=100.100.100.1,arp_tpa=100.100.100.1,arp_op=1,arp_sha=00:00:5e:00:01:04,arp_tha=00:00:00:00:00:00 >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms >> waiting for main to quiesce >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms >> waiting for main to quiesce >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254: >> connection dropped (Broken pipe) >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms >> poll interval (3146ms user, 45722ms system) >> >> ... >> ... >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms >> waiting for main to quiesce >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms >> waiting for main to quiesce >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254: >> connection dropped (Broken pipe) >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms >> poll interval (3146ms user, 45722ms system) >> 2025-06-18T05:33:06.783Z|434234|timeval|WARN|faults: 3 minor, 0 major >> 2025-06-18T05:33:06.783Z|434235|timeval|WARN|context switches: 2 >> voluntary, 38 involuntary >> 2025-06-18T05:33:06.783Z|434236|coverage|INFO|Skipping details of >> duplicate event coverage for hash=81106859 >> 2025-06-18T05:33:06.783Z|434237|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:274 (99% CPU usage) >> 2025-06-18T05:33:06.783Z|434238|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:157 (99% CPU usage) >> 2025-06-18T05:33:06.783Z|434239|poll_loop|INFO|wakeup due to [POLLOUT] >> on fd 104 (/var/run/openvswitch/ovs-vswitchd.12738.ctl<->) at >> ../lib/stream-fd.c:153 (99% CPU usage) >> 2025-06-18T05:33:06.783Z|434240|poll_loop|INFO|wakeup due to >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:153 (99% CPU usage) >> 2025-06-18T05:33:06.783Z|434241|poll_loop|INFO|wakeup due to [POLLIN] >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at >> ../lib/stream-fd.c:157 (99% CPU usage) >> 2025-06-18T05:33:06.787Z|12945|ofproto_dpif_upcall(handler35)|INFO|Dropped >> 49 log messages in last 49 seconds (most recently, 49 seconds ago) due >> to excessive rate >> 2025-06-18T05:33:06.787Z|12946|ofproto_dpif_upcall(handler35)|INFO|received >> packet on unassociated datapath port 7 (no OpenFlow port for datapath >> port 7) >> 2025-06-18T05:33:06.791Z|434242|rconn|WARN|br-int<->unix#963255: >> connection dropped (Broken pipe) >> >> ... >> >> ------------------------------------ >> >> We took core dump when its blocked, and we see the below stack trace >> >> #0 0x00007f48203205bf in recvmsg () from /lib64/libc.so.6 #1 >> 0x000056195f0be300 in nl_sock_recv__ (sock=0x5619680045a0, >> buf=buf@entry=0x7fff259e9070, nsid=nsid@entry=0x0, >> wait=wait@entry=false) at ../lib/netlink-socket.c:712 #2 >> 0x000056195f0c22d9 in nl_dump_refill (buffer=<optimized out>, >> dump=<optimized out>) at ../lib/netlink-socket.c:1182 #3 nl_dump_next >> (dump=dump@entry=0x7fff259e90b0, reply=reply@entry=0x7fff259e9030, >> buffer=buffer@entry=0x7fff259e9070) at ../lib/netlink-socket.c:1265 #4 >> 0x000056195f0c2a5e in nl_ct_flush_zone (flush_zone=2472) at >> ../lib/netlink-conntrack.c:312 #5 0x000056195ef2b970 in >> handle_nxt_ct_flush_zone (oh=<optimized out>, ofconn=0x56197640d880) >> at ../ofproto/ofproto.c:948 #6 handle_single_part_openflow >> (type=OFPTYPE_CT_FLUSH_ZONE, oh=<optimized out>, >> ofconn=0x56197640d880) at ../ofproto/ofproto.c:8870 #7 handle_openflow >> (ofconn=0x56197640d880, msgs=0x7fff259e93f0) at >> ../ofproto/ofproto.c:8936 #8 0x000056195ef72106 in ofconn_run >> (handle_openflow=0x56195ef2b480 <handle_openflow>, >> ofconn=0x56197640d880) at ../ofproto/connmgr.c:1329 #9 connmgr_run >> (mgr=0x5619680b10d0, handle_openflow=0x56195ef2b480 <handle_openflow>) >> at ../ofproto/connmgr.c:356 #10 0x000056195ef278ee in ofproto_run >> (p=<optimized out>) at ../ofproto/ofproto.c:1967 #11 >> 0x000056195ef0c43c in bridge_run__ () at ../vswitchd/bridge.c:3241 #12 >> 0x000056195ef1993e in bridge_run () at ../vswitchd/bridge.c:3300 #13 >> 0x000056195e818de6 in main (argc=<optimized out>, argv=<optimized >> out>) at ../vswitchd/ovs-vswitchd.c:130 >> >> >> Looks like something is wrong in the kernel. >> >> Any pointers or suggestions would be great > > This does not ring a bell. The question is: is it a single system call > causing the delay, or many individual ones? It seems to be the OVS main > thread. Maybe you can use the kernel delay script to get an idea of whether > it’s a lot of syscalls or just a single one. > > More details can be found here: > > - > https://developers.redhat.com/articles/2024/01/31/debugging-open-vswitch-logs-long-poll-interval-blocked-waiting#getting_the_initial_data > - > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame > > Maybe analyze_perf_pmd_syscall.py could also help to see the actual callback > traces.
Yeah, that would be helpful. Looking at the trace, it seems like dumping all the CT entries from the kernel while flushing the zone. How many CT entries/zones do you have in this setup? If this is the actual root cause then upgrading both OVS and the kernel may help, since Felix significanlty improved performance of this operation: https://github.com/openvswitch/ovs/commit/9ec849e8aa869b646c372fac552ae2609a4b5f66 You'll need OVS 3.4+ and kernel 6.8+. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev