On Tue, Jul 22, 2025 at 08:46:37AM -0400, Aaron Conole wrote: > Adrián Moreno via discuss <ovs-discuss@openvswitch.org> writes: > > > On Tue, Apr 29, 2025 at 01:48:42PM +0800, chenyongchang--- via discuss > > wrote: > >> > >> Hello, > >> In a high-traffic scenario, when modifying the bond-rebalance-interval > >> configuration for an OVS-DPDK bond interface, > >> we observed that OVS-DPDK generated USERSPACE_INVALID_PORT_DROP errors. > >> > >> After analysis, executing the command ovs-vsctl set port dpdk_tun_port > >> other_config:bond-rebalance-interval=1000 > >> triggered the following process, ultimately leading to the > >> USERSPACE_INVALID_PORT_DROP errors: > >> > >> 1. Execution of memset(bond->hash, 0, hash_len); > >> Call stack: > >> #0 bond_entry_reset (bond=0x4c64bc0) at ofproto/bond.c:1852 > >> #1 0x0000000001a2a238 in bond_reconfigure (bond=0x4c64bc0, > >> s=0x7fff6d1dec10) at ofproto/bond.c:514 > >> #2 0x0000000001a4e253 in bundle_set (ofproto_=0x4c21110, > >> aux=0x4c39d90, s=0x7fff6d1deb90) at ofproto/ofproto-dpif.c:3484 > >> #3 0x0000000001a31b27 in ofproto_bundle_register (ofproto=0x4c21110, > >> aux=0x4c39d90, s=0x7fff6d1deb90) at ofproto/ofproto.c:1430 > >> #4 0x0000000001a1c80e in port_configure (port=0x4c39d90) at > >> vswitchd/bridge.c:1384 > >> #5 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x4bb37c0) at > >> vswitchd/bridge.c:1005 > >> #6 0x0000000001a223e7 in bridge_run () at vswitchd/bridge.c:3423 > >> #7 0x0000000001a27b9e in main (argc=11, argv=0x7fff6d1def38) at > >> vswitchd/ovs-vswitchd.c:129 > >> > >> 2. Execution of member_map[i] = OFPP_NONE > >> Call stack: > >> #0 bond_add_lb_output_buckets (bond=0x37220f0) at ofproto/bond.c:2135 > >> #1 0x0000000001a29b4f in update_recirc_rules__ (bond=0x37220f0) at > >> ofproto/bond.c:356 > >> #2 0x0000000001a29ebe in update_recirc_rules (bond=0x37220f0) at > >> ofproto/bond.c:426 > >> #3 0x0000000001a2a262 in bond_reconfigure (bond=0x37220f0, > >> s=0x7fffffffe230) at ofproto/bond.c:520 > >> #4 0x0000000001a4e292 in bundle_set (ofproto_=0x366afa0, > >> aux=0x3713290, s=0x7fffffffe1b0) at ofproto/ofproto-dpif.c:3484 > >> #5 0x0000000001a31b66 in ofproto_bundle_register (ofproto=0x366afa0, > >> aux=0x3713290, s=0x7fffffffe1b0) at ofproto/ofproto.c:1430 > >> #6 0x0000000001a1c80e in port_configure (port=0x3713290) at > >> vswitchd/bridge.c:1384 > >> #7 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x3660180) at > >> vswitchd/bridge.c:1005 > >> #8 0x0000000001a223b7 in bridge_run () at vswitchd/bridge.c:3422 > >> #9 0x0000000001a27b92 in main (argc=1, argv=0x7fffffffe558) > >> > >> 3.PMD thread sending packets found port_no=0xffffffff > >> Call stack: > >> #0 dp_execute_output_action (pmd=0x7fff68731010, packets_=0x7fff53ff8f50, > >> should_steal=true, port_no=4294967295) > >> at lib/dpif-netdev.c:9273 > >> #1 0x0000000001acaf6d in dp_execute_lb_output_action > >> (pmd=0x7fff68731010, packets_=0x7fff53ff9ca0, should_steal=true, > >> bond=1) at lib/dpif-netdev.c:9350 > >> #2 0x0000000001acb0b6 in dp_execute_cb (aux_=0x7fff53ff9b30, > >> packets_=0x7fff53ff9ca0, a=0x7fff4800f074, should_steal=true) > >> at lib/dpif-netdev.c:9379 > >> #3 0x0000000001b526b5 in odp_execute_actions (dp=0x7fff53ff9b30, > >> batch=0x7fff53ff9ca0, steal=true, > >> actions=0x7fff4800f074, actions_len=8, dp_execute_action=0x1acafc0 > >> <dp_execute_cb>) at lib/odp-execute.c:1016 > >> #4 0x0000000001acbc8e in dp_netdev_execute_actions (pmd=0x7fff68731010, > >> packets=0x7fff53ff9ca0, should_steal=true, > >> flow=0x7fff4800ea70, actions=0x7fff4800f074, actions_len=8) at > >> lib/dpif-netdev.c:9698 > >> #5 0x0000000001ac8133 in packet_batch_per_flow_execute > >> (batch=0x7fff53ff9c90, pmd=0x7fff68731010) > >> at lib/dpif-netdev.c:8338 > >> #6 0x0000000001aca3ad in dp_netdev_input__ (pmd=0x7fff68731010, > >> packets=0x7fff53ffbdf0, md_is_valid=false, port_no=4) > >> at lib/dpif-netdev.c:9055 > >> #7 0x0000000001aca3ff in dp_netdev_input (pmd=0x7fff68731010, > >> packets=0x7fff53ffbdf0, port_no=4) at lib/dpif-netdev.c:9064 > >> #8 0x0000000001ac0da2 in dp_netdev_process_rxq_port (pmd=0x7fff68731010, > >> rxq=0x3720220, port_no=4) > >> at lib/dpif-netdev.c:5690 > >> #9 0x0000000001ac566a in pmd_thread_main (f_=0x7fff68731010) at > >> lib/dpif-netdev.c:7334 > >> #10 0x0000000001bc4b1b in ovsthread_wrapper (aux_=0x3711920) at > >> lib/ovs-thread.c:422 > >> #11 0x00007ffff76f4802 in start_thread () from /lib64/libc.so.6 > >> --Type <RET> for more, q to quit, c to continue without paging-- > >> #12 0x00007ffff7694314 in clone () from /lib64/libc.so.6 > >> > >> The main issue arises from a timing discrepancy between the main > >> thread and the PMD thread when operating on pmd->tx_bonds, > >> which causes the PMD to temporarily resolve the egress interface to > >> 0xffffffff (an invalid value). > >> What solutions does the community propose to address this problem? > > > > Reconfiguring the bonding flows for a simple change in the > > rebalance_interval seems an overkill. It was added so that users could > > disable rebalancing but just increasing or decreasing the interval > > (without initial or final values being zero) should not trigger a bond > > reset. > > You mean by doing something like: > > diff --git a/ofproto/bond.c b/ofproto/bond.c > index 3859ddca08..86e21607e5 100644 > --- a/ofproto/bond.c > +++ b/ofproto/bond.c > @@ -459,8 +459,14 @@ bond_reconfigure(struct bond *bond, const struct > bond_settings *s) > } > > if (bond->rebalance_interval != s->rebalance_interval) { > + /* Recompute the next rebalance interval by moving the next_rebalance > + * to be offset by the new interval. Then let the rebalance code > + * trigger a rebalance based on the new details. In this case, if > + * all that was updated is the rebalance interval, we can skip > + * triggering the rest of the port reconfigure mechanism. */ > + int old_start_time = bond->next_rebalance - bond->rebalance_interval; > bond->rebalance_interval = s->rebalance_interval; > - revalidate = true; > + bond->next_rebalance = old_start_time + bond->rebalance_interval; > } > > if (bond->balance != s->balance) { > > > Also, if bond_reconfigure resets the bond hashes, we should probaly not > > wait until bond_run() calls bond_update_post_recirc_rules__() to initialize > > them. Even for recirc-driven bonds, this makes an initial update of > > post-recirc rules with all hashes being zero. > > This might require a bit deeper look, and I consider it a bit different > anyway. The above patch should at least allow for updating the > rebalance interval, but it keeps the idea of deferring the work until > the actual bond run has been invoked. NOTE: I didn't test the above > change in any way. I need to recheck if the reconfigure code is run > inline with the rest of the bond code for the dpif - if not the > bond->next_rebalance is unsafe. Consider this just illustration. > > > I'll take a deeper look probably next week. > > > >> > >> our ovs version 2.17.5 lts. > > > > Note 2.17.5 is not EOL. > > 2.17.11 is the most recent release for 2.17, but our current LTS is 3.3, > and the most recent release is 3.3.5 - so it would be a good idea to > move forward.
Hehe, I wanted to say that but a "not" got in the way :-) Thanks. Adrián _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss