On Thu, Jun 19, 2025 at 5:12 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 6/19/25 9:08 AM, Eelco Chaudron via dev wrote:
> >
> >
> > On 19 Jun 2025, at 2:22, Numan Siddique wrote:
> >
> >> Hello OVS folks,
> >>
> >> We are seeing a strange behavior with ovs-vswitchd when we upgraded
> >> ovn-controller to 24.09 from 23.09.
> >>
> >> After ovn-controller is upgraded,  ovs-vswitchd is blocked for a long
> >> time and because  of which ovn-controller can't program the flows.
> >> The  compute node was hosting just a few VMs.  Once the VMs were
> >> migrated to other node, ovs-vswitchd and ovn-controller are back to
> >> normal.
> >>
> >> Below are some of the log messages seen in ovs-vswitchd.log.
> >>
> >> -------------------------------------------------
> >>
> >> 2025-06-18T05:32:17.770Z|434224|timeval|WARN|Unreasonably long 59741ms
> >> poll interval (3942ms user, 55611ms system)
> >> 2025-06-18T05:32:17.770Z|434225|timeval|WARN|faults: 29 minor, 0 major
> >> 2025-06-18T05:32:17.770Z|434226|timeval|WARN|context switches: 0
> >> voluntary, 95 involuntary
> >> 2025-06-18T05:32:17.770Z|434227|coverage|INFO|Skipping details of
> >> duplicate event coverage for hash=81106859
> >> 2025-06-18T05:32:17.770Z|434228|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:274 (99% CPU usage)
> >> 2025-06-18T05:32:17.770Z|434229|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:157 (99% CPU usage)
> >> 2025-06-18T05:32:17.770Z|434230|poll_loop|INFO|wakeup due to
> >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:153 (99% CPU usage)
> >> 2025-06-18T05:32:17.770Z|434231|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:157 (99% CPU usage)
> >> 2025-06-18T05:32:17.774Z|11877|ofproto_dpif_upcall(handler28)|INFO|Dropped
> >> 45 log messages in last 60 seconds (most recently, 60 seconds ago) due
> >> to excessive rate
> >> 2025-06-18T05:32:17.774Z|07880|ofproto_dpif_upcall(revalidator40)|WARN|Dropped
> >> 36 log messages in last 60 seconds (most recently, 60 seconds ago) due
> >> to excessive rate
> >> 2025-06-18T05:32:17.774Z|11878|ofproto_dpif_upcall(handler28)|INFO|received
> >> packet on unassociated datapath port 7 (no OpenFlow port for datapath
> >> port 7)
> >> 2025-06-18T05:32:17.774Z|07881|ofproto_dpif_upcall(revalidator40)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:d1e6759c-66bb-46ae-be16-2f6cbbbc2839
> >> 2025-06-18T05:32:18.275Z|08136|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:cf11e102-9fc6-42b3-aadd-6e5a1a3603a8
> >> 2025-06-18T05:32:18.275Z|08137|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:2e069616-631a-491c-ad14-c0460b3ccad2
> >> 2025-06-18T05:32:18.275Z|08138|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:ef4f40b5-625e-47ee-9d26-09b816b497bf
> >> 2025-06-18T05:32:18.275Z|08139|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:f60d9bbc-4880-41c4-b799-e501b5d9f0a7
> >> 2025-06-18T05:32:18.275Z|08140|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:aec47146-a9ec-42fa-97a9-6896017c865a
> >> 2025-06-18T05:32:18.275Z|08141|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:e63cac09-4159-4187-8dff-64b432019bb3
> >> 2025-06-18T05:32:18.275Z|08142|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:f6984063-81ad-432b-96d9-22dadd9f8edb
> >> 2025-06-18T05:32:18.275Z|08143|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:609d6c3f-7093-4dba-bdb9-5607add1fb8a
> >> 2025-06-18T05:32:18.275Z|08144|ofproto_dpif_upcall(revalidator46)|WARN|Failed
> >> to acquire udpif_key corresponding to unexpected flow (Invalid
> >> argument): ufid:81887285-1a37-47d5-959a-85fa6a3547cc
> >> 2025-06-18T05:32:18.780Z|466262|ovs_rcu(urcu4)|WARN|blocked 1000 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:19.779Z|466263|ovs_rcu(urcu4)|WARN|blocked 2000 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:21.779Z|466264|ovs_rcu(urcu4)|WARN|blocked 4000 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:25.780Z|466265|ovs_rcu(urcu4)|WARN|blocked 8001 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:26.092Z|353399|ofproto_dpif_xlate(handler17)|WARN|Dropped
> >> 3 log messages in last 105 seconds (most recently, 86 seconds ago) due
> >> to excessive rate
> >> 2025-06-18T05:32:26.092Z|353400|ofproto_dpif_xlate(handler17)|WARN|over
> >> 4096 resubmit actions on bridge br-int while processing
> >> arp,in_port=4,dl_vlan=120,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:01:04,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=100.100.100.1,arp_tpa=100.100.100.1,arp_op=1,arp_sha=00:00:5e:00:01:04,arp_tha=00:00:00:00:00:00
> >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254:
> >> connection dropped (Broken pipe)
> >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms
> >> poll interval (3146ms user, 45722ms system)
> >>
> >> ...
> >> ...
> >> 2025-06-18T05:32:33.781Z|466266|ovs_rcu(urcu4)|WARN|blocked 16001 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:32:49.780Z|466267|ovs_rcu(urcu4)|WARN|blocked 32000 ms
> >> waiting for main to quiesce
> >> 2025-06-18T05:33:06.782Z|434232|rconn|WARN|br-int<->unix#963254:
> >> connection dropped (Broken pipe)
> >> 2025-06-18T05:33:06.783Z|434233|timeval|WARN|Unreasonably long 49013ms
> >> poll interval (3146ms user, 45722ms system)
> >> 2025-06-18T05:33:06.783Z|434234|timeval|WARN|faults: 3 minor, 0 major
> >> 2025-06-18T05:33:06.783Z|434235|timeval|WARN|context switches: 2
> >> voluntary, 38 involuntary
> >> 2025-06-18T05:33:06.783Z|434236|coverage|INFO|Skipping details of
> >> duplicate event coverage for hash=81106859
> >> 2025-06-18T05:33:06.783Z|434237|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 91 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:274 (99% CPU usage)
> >> 2025-06-18T05:33:06.783Z|434238|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 95 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:157 (99% CPU usage)
> >> 2025-06-18T05:33:06.783Z|434239|poll_loop|INFO|wakeup due to [POLLOUT]
> >> on fd 104 (/var/run/openvswitch/ovs-vswitchd.12738.ctl<->) at
> >> ../lib/stream-fd.c:153 (99% CPU usage)
> >> 2025-06-18T05:33:06.783Z|434240|poll_loop|INFO|wakeup due to
> >> [POLLOUT][POLLHUP] on fd 7 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:153 (99% CPU usage)
> >> 2025-06-18T05:33:06.783Z|434241|poll_loop|INFO|wakeup due to [POLLIN]
> >> on fd 107 (/var/run/openvswitch/br-int.mgmt<->) at
> >> ../lib/stream-fd.c:157 (99% CPU usage)
> >> 2025-06-18T05:33:06.787Z|12945|ofproto_dpif_upcall(handler35)|INFO|Dropped
> >> 49 log messages in last 49 seconds (most recently, 49 seconds ago) due
> >> to excessive rate
> >> 2025-06-18T05:33:06.787Z|12946|ofproto_dpif_upcall(handler35)|INFO|received
> >> packet on unassociated datapath port 7 (no OpenFlow port for datapath
> >> port 7)
> >> 2025-06-18T05:33:06.791Z|434242|rconn|WARN|br-int<->unix#963255:
> >> connection dropped (Broken pipe)
> >>
> >> ...
> >>
> >> ------------------------------------
> >>
> >> We took core dump when its blocked,  and we see the below stack trace
> >>
> >> #0 0x00007f48203205bf in recvmsg () from /lib64/libc.so.6 #1
> >> 0x000056195f0be300 in nl_sock_recv__ (sock=0x5619680045a0,
> >> buf=buf@entry=0x7fff259e9070, nsid=nsid@entry=0x0,
> >> wait=wait@entry=false) at ../lib/netlink-socket.c:712 #2
> >> 0x000056195f0c22d9 in nl_dump_refill (buffer=<optimized out>,
> >> dump=<optimized out>) at ../lib/netlink-socket.c:1182 #3 nl_dump_next
> >> (dump=dump@entry=0x7fff259e90b0, reply=reply@entry=0x7fff259e9030,
> >> buffer=buffer@entry=0x7fff259e9070) at ../lib/netlink-socket.c:1265 #4
> >> 0x000056195f0c2a5e in nl_ct_flush_zone (flush_zone=2472) at
> >> ../lib/netlink-conntrack.c:312 #5 0x000056195ef2b970 in
> >> handle_nxt_ct_flush_zone (oh=<optimized out>, ofconn=0x56197640d880)
> >> at ../ofproto/ofproto.c:948 #6 handle_single_part_openflow
> >> (type=OFPTYPE_CT_FLUSH_ZONE, oh=<optimized out>,
> >> ofconn=0x56197640d880) at ../ofproto/ofproto.c:8870 #7 handle_openflow
> >> (ofconn=0x56197640d880, msgs=0x7fff259e93f0) at
> >> ../ofproto/ofproto.c:8936 #8 0x000056195ef72106 in ofconn_run
> >> (handle_openflow=0x56195ef2b480 <handle_openflow>,
> >> ofconn=0x56197640d880) at ../ofproto/connmgr.c:1329 #9 connmgr_run
> >> (mgr=0x5619680b10d0, handle_openflow=0x56195ef2b480 <handle_openflow>)
> >> at ../ofproto/connmgr.c:356 #10 0x000056195ef278ee in ofproto_run
> >> (p=<optimized out>) at ../ofproto/ofproto.c:1967 #11
> >> 0x000056195ef0c43c in bridge_run__ () at ../vswitchd/bridge.c:3241 #12
> >> 0x000056195ef1993e in bridge_run () at ../vswitchd/bridge.c:3300 #13
> >> 0x000056195e818de6 in main (argc=<optimized out>, argv=<optimized
> >> out>) at ../vswitchd/ovs-vswitchd.c:130
> >>
> >>
> >> Looks like something is wrong in the kernel.
> >>
> >> Any pointers or suggestions would be great
> >
> > This does not ring a bell. The question is: is it a single system call 
> > causing the delay, or many individual ones? It seems to be the OVS main 
> > thread. Maybe you can use the kernel delay script to get an idea of whether 
> > it’s a lot of syscalls or just a single one.
> >
> > More details can be found here:
> >
> > - 
> > https://developers.redhat.com/articles/2024/01/31/debugging-open-vswitch-logs-long-poll-interval-blocked-waiting#getting_the_initial_data
> > - 
> > https://developers.redhat.com/articles/2023/07/24/troubleshooting-open-vswitch-kernel-blame
> >
> > Maybe analyze_perf_pmd_syscall.py could also help to see the actual 
> > callback traces.
>
> Yeah, that would be helpful.
>
>
> Looking at the trace, it seems like dumping all the CT entries from the
> kernel while flushing the zone.  How many CT entries/zones do you have
> in this setup?
>
> If this is the actual root cause then upgrading both OVS and the kernel
> may help, since Felix significanlty improved performance of this operation:
>   
> https://github.com/openvswitch/ovs/commit/9ec849e8aa869b646c372fac552ae2609a4b5f66
>
> You'll need OVS 3.4+ and kernel 6.8+.

Thanks Eelco and Ilya.

This is really helpful.  There are a lot of CT zone entries around 90K
for one zone.
So I suspect we are hitting the same issue.

Thanks
Numan

>
> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to