On Thu, Jun 19, 2025 at 5:12 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 6/19/25 9:08 AM, Eelco Chaudron via dev wrote: > > > > > > On 19 Jun 2025, at 2:22, Numan Siddique wrote: > > > >> Hello OVS folks, > >> > >> We are seeing a strange behavior with ovs-vswitchd when we upgraded > >> ovn-controller to 24.09 from 23.09. > >> > >> After ovn-controller is upgraded, ovs-vswitchd is blocked for a long > >> time and because of which ovn-controller can't program the flows. > >> The compute node was hosting just a few VMs. Once the VMs were > >> migrated to other node, ovs-vswitchd and ovn-controller are back to > >> normal. > >> > >> Below are some of the log messages seen in ovs-vswitchd.log. > >> > >> ------------------------------------------------- > >> > >> 2025-06-18T05:32:17.770Z|434224|timeval|WARN|Unreasonably long 59741ms > >> poll interval (3942ms user, 55611ms system) > >> 2025-06-18T05:32:17.770Z|434225|timeval|WARN|faults: 29 minor, 0 major > >> 2025-06-18T05:32:17.770Z|434226|timeval|WARN|context switches: 0 > >> voluntary, 95 involuntary > >> 2025-06-18T05:32:17.770Z|434227|coverage|INFO|Skipping details of > >> duplicate event coverage for hash=81106859 > >> 2025-06-18T05:32:17.770Z|434228|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:274 (99% CPU usage) > >> 2025-06-18T05:32:17.770Z|434229|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:157 (99% CPU usage) > >> 2025-06-18T05:32:17.770Z|434230|poll_loop|INFO|wakeup due to > >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:153 (99% CPU usage) > >> 2025-06-18T05:32:17.770Z|434231|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:157 (99% CPU usage) > >> 2025-06-18T05:32:17.774Z|11877|ofproto_dpif_upcall(handler28)|INFO|Dropped > >> 45 log messages in last 60 seconds (most recently, 60 seconds ago) due > >> to excessive rate > >> 2025-06-18T05:32:17.774Z|07880|ofproto_dpif_upcall(revalidator40)|WARN|Dropped > >> 36 log messages in last 60 seconds (most recently, 60 seconds ago) due > >> to excessive rate > >> 2025-06-18T05:32:17.774Z|11878|ofproto_dpif_upcall(handler28)|INFO|received > >> packet on unassociated datapath port 7 (no OpenFlow port for datapath > >> port 7) > >> 2025-06-18T05:32:17.774Z|07881|ofproto_dpif_upcall(revalidator40)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:d1e6759c-66bb-46ae-be16-2f6cbbbc2839 > >> 2025-06-18T05:32:18.275Z|08136|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:cf11e102-9fc6-42b3-aadd-6e5a1a3603a8 > >> 2025-06-18T05:32:18.275Z|08137|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:2e069616-631a-491c-ad14-c0460b3ccad2 > >> 2025-06-18T05:32:18.275Z|08138|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:ef4f40b5-625e-47ee-9d26-09b816b497bf > >> 2025-06-18T05:32:18.275Z|08139|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:f60d9bbc-4880-41c4-b799-e501b5d9f0a7 > >> 2025-06-18T05:32:18.275Z|08140|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:aec47146-a9ec-42fa-97a9-6896017c865a > >> 2025-06-18T05:32:18.275Z|08141|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:e63cac09-4159-4187-8dff-64b432019bb3 > >> 2025-06-18T05:32:18.275Z|08142|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:f6984063-81ad-432b-96d9-22dadd9f8edb > >> 2025-06-18T05:32:18.275Z|08143|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:609d6c3f-7093-4dba-bdb9-5607add1fb8a > >> 2025-06-18T05:32:18.275Z|08144|ofproto_dpif_upcall(revalidator46)|WARN|Failed > >> to acquire udpif_key corresponding to unexpected flow (Invalid > >> argument): ufid:81887285-1a37-47d5-959a-85fa6a3547cc > >> 2025-06-18T05:32:18.780Z|466262|ovs_rcu(urcu4)|WARN|blocked 1000 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:19.779Z|466263|ovs_rcu(urcu4)|WARN|blocked 2000 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:21.779Z|466264|ovs_rcu(urcu4)|WARN|blocked 4000 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:25.780Z|466265|ovs_rcu(urcu4)|WARN|blocked 8001 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:26.092Z|353399|ofproto_dpif_xlate(handler17)|WARN|Dropped > >> 3 log messages in last 105 seconds (most recently, 86 seconds ago) due > >> to excessive rate > >> 2025-06-18T05:32:26.092Z|353400|ofproto_dpif_xlate(handler17)|WARN|over > >> 4096 resubmit actions on bridge br-int while processing > >> arp,in_port=4,dl_vlan=120,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:01:04,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=100.100.100.1,arp_tpa=100.100.100.1,arp_op=1,arp_sha=00:00:5e:00:01:04,arp_tha=00:00:00:00:00:00 > >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms > >> waiting for main to quiesce > >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254: > >> connection dropped (Broken pipe) > >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms > >> poll interval (3146ms user, 45722ms system) > >> > >> ... > >> ... > >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms > >> waiting for main to quiesce > >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms > >> waiting for main to quiesce > >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254: > >> connection dropped (Broken pipe) > >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms > >> poll interval (3146ms user, 45722ms system) > >> 2025-06-18T05:33:06.783Z|434234|timeval|WARN|faults: 3 minor, 0 major > >> 2025-06-18T05:33:06.783Z|434235|timeval|WARN|context switches: 2 > >> voluntary, 38 involuntary > >> 2025-06-18T05:33:06.783Z|434236|coverage|INFO|Skipping details of > >> duplicate event coverage for hash=81106859 > >> 2025-06-18T05:33:06.783Z|434237|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:274 (99% CPU usage) > >> 2025-06-18T05:33:06.783Z|434238|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:157 (99% CPU usage) > >> 2025-06-18T05:33:06.783Z|434239|poll_loop|INFO|wakeup due to [POLLOUT] > >> on fd 104 (/var/run/openvswitch/ovs-vswitchd.12738.ctl<->) at > >> ../lib/stream-fd.c:153 (99% CPU usage) > >> 2025-06-18T05:33:06.783Z|434240|poll_loop|INFO|wakeup due to > >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:153 (99% CPU usage) > >> 2025-06-18T05:33:06.783Z|434241|poll_loop|INFO|wakeup due to [POLLIN] > >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at > >> ../lib/stream-fd.c:157 (99% CPU usage) > >> 2025-06-18T05:33:06.787Z|12945|ofproto_dpif_upcall(handler35)|INFO|Dropped > >> 49 log messages in last 49 seconds (most recently, 49 seconds ago) due > >> to excessive rate > >> 2025-06-18T05:33:06.787Z|12946|ofproto_dpif_upcall(handler35)|INFO|received > >> packet on unassociated datapath port 7 (no OpenFlow port for datapath > >> port 7) > >> 2025-06-18T05:33:06.791Z|434242|rconn|WARN|br-int<->unix#963255: > >> connection dropped (Broken pipe) > >> > >> ... > >> > >> ------------------------------------ > >> > >> We took core dump when its blocked, and we see the below stack trace > >> > >> #0 0x00007f48203205bf in recvmsg () from /lib64/libc.so.6 #1 > >> 0x000056195f0be300 in nl_sock_recv__ (sock=0x5619680045a0, > >> buf=buf@entry=0x7fff259e9070, nsid=nsid@entry=0x0, > >> wait=wait@entry=false) at ../lib/netlink-socket.c:712 #2 > >> 0x000056195f0c22d9 in nl_dump_refill (buffer=<optimized out>, > >> dump=<optimized out>) at ../lib/netlink-socket.c:1182 #3 nl_dump_next > >> (dump=dump@entry=0x7fff259e90b0, reply=reply@entry=0x7fff259e9030, > >> buffer=buffer@entry=0x7fff259e9070) at ../lib/netlink-socket.c:1265 #4 > >> 0x000056195f0c2a5e in nl_ct_flush_zone (flush_zone=2472) at > >> ../lib/netlink-conntrack.c:312 #5 0x000056195ef2b970 in > >> handle_nxt_ct_flush_zone (oh=<optimized out>, ofconn=0x56197640d880) > >> at ../ofproto/ofproto.c:948 #6 handle_single_part_openflow > >> (type=OFPTYPE_CT_FLUSH_ZONE, oh=<optimized out>, > >> ofconn=0x56197640d880) at ../ofproto/ofproto.c:8870 #7 handle_openflow > >> (ofconn=0x56197640d880, msgs=0x7fff259e93f0) at > >> ../ofproto/ofproto.c:8936 #8 0x000056195ef72106 in ofconn_run > >> (handle_openflow=0x56195ef2b480 <handle_openflow>, > >> ofconn=0x56197640d880) at ../ofproto/connmgr.c:1329 #9 connmgr_run > >> (mgr=0x5619680b10d0, handle_openflow=0x56195ef2b480 <handle_openflow>) > >> at ../ofproto/connmgr.c:356 #10 0x000056195ef278ee in ofproto_run > >> (p=<optimized out>) at ../ofproto/ofproto.c:1967 #11 > >> 0x000056195ef0c43c in bridge_run__ () at ../vswitchd/bridge.c:3241 #12 > >> 0x000056195ef1993e in bridge_run () at ../vswitchd/bridge.c:3300 #13 > >> 0x000056195e818de6 in main (argc=<optimized out>, argv=<optimized > >> out>) at ../vswitchd/ovs-vswitchd.c:130 > >> > >> > >> Looks like something is wrong in the kernel. > >> > >> Any pointers or suggestions would be great > > > > This does not ring a bell. The question is: is it a single system call > > causing the delay, or many individual ones? It seems to be the OVS main > > thread. Maybe you can use the kernel delay script to get an idea of whether > > it’s a lot of syscalls or just a single one. > > > > More details can be found here: > > > > - > > https://developers.redhat.com/articles/2024/01/31/debugging-open-vswitch-logs-long-poll-interval-blocked-waiting#getting_the_initial_data > > - > > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame > > > > Maybe analyze_perf_pmd_syscall.py could also help to see the actual > > callback traces. > > Yeah, that would be helpful. > > > Looking at the trace, it seems like dumping all the CT entries from the > kernel while flushing the zone. How many CT entries/zones do you have > in this setup? > > If this is the actual root cause then upgrading both OVS and the kernel > may help, since Felix significanlty improved performance of this operation: > > https://github.com/openvswitch/ovs/commit/9ec849e8aa869b646c372fac552ae2609a4b5f66 > > You'll need OVS 3.4+ and kernel 6.8+.
Thanks Eelco and Ilya. This is really helpful. There are a lot of CT zone entries around 90K for one zone. So I suspect we are hitting the same issue. Thanks Numan > > Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev