Jiayuan Chen <[email protected]> wrote:

>March 2, 2026 at 16:10, "Sebastian Andrzej Siewior" <[email protected] 
>mailto:[email protected]?to=%22Sebastian%20Andrzej%20Siewior%22%20%3Cbigeasy%40linutronix.de%3E
> > wrote:
>
>
>> 
>> On 2026-02-28 03:36:24 [+0000], Jiayuan Chen wrote:
>> 
>> > 
>> > My only concern is that this will waste a percpu u32 per bond
>> >  device for the majority of bonding use cases (which use modes other than
>> >  balance-rr), which could be a few hundred bytes on a large machine.
>> >  
>> >  Does everything work reliably if the rr_tx_counter allocation
>> >  happens conditionally on mode == BOND_MODE_ROUNDROBIN in bond_setup, as
>> >  well as in bond_option_mode_set?
>> > 
>> …
>> 
>> > 
>> > An alternative would be to allocate conditionally in bond_init() (since 
>> > the default mode is round-robin)
>> >  and manage allocation/deallocation in bond_option_mode_set() when the 
>> > mode changes.
>> > 
>> This sounds reasonable.
>> 
>> > 
>> > This is a trade-off between the added complexity of conditional alloc/free 
>> > across multiple code
>> >  paths and saving a per-CPU u32 for non-round-robin bonds.
>> >  
>> >  For the per-CPU u32 overhead, it's only 4 extra bytes per CPU per bond 
>> > device — and machines with
>> >  that many CPUs tend to have plenty of memory to match.
>> > 
>> 4 bytes is the minimum allocation for per-CPU memory. The memory is
>> already "there" it is just not assigned. So for the 4 byte allocation it
>> is needed to find a single area (the smallest allocation size).
>> In case there no free block, a new block will be allocated and mapped
>> for each CPU which the part that costs memory.
>> That said, we should not waste memory but it is not _that_ expensive
>> either for a bond device. Things change if here are hundreds of devices.
>
>
>Hi Jay, Sebastian,
>
>Sorry, the conditional alloc/free approach in bond_option_mode_set()
>I suggested earlier doesn't work well on closer inspection.
>
>Since bond_option_mode_set() requires the bond to be down, and as this
>bug shows, the XDP redirect path can still reach a downed bond device,
>we need to carefully order the operations:
>
>when switching to RR, alloc rr_tx_counter before setting mode;
>when switching away, set mode before freeing rr_tx_counter.
>
>Strictly speaking, this also requires smp_wmb()/smp_rmb() pairs to
>guarantee the ordering is visible to other CPUs — the write side in
>bond_option_mode_set() and the read side in the XDP path.
>
>In practice the race window is very small and unlikely to trigger, but
>leaving out the barriers would look incorrect, and adding them to the
>XDP hot path feels wrong for saving only 4 bytes per CPU per bond device.

        Ok, fair enough.

        Can you repost the last version, and please note in the commit
log that we're deliberately allowing the memory to be allocated but
unused for most cases for the above reasons?  I'm also thinking that a
short comment in the code along the lines of "unused in most modes but
needed for XDP" is worthwhile.

        Assuming Jakub, Eric, and Paolo don't object to the patch, I
think we should have it documented that we're doing this on purpose so
nobody tries to "fix" it later.

        -J


>smp_wmb()/smp_rmb() (no test): 
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4279,12 +4279,6 @@ static int bond_open(struct net_device *bond_dev)
>       struct list_head *iter;
>       struct slave *slave;
>
>-     if (BOND_MODE(bond) == BOND_MODE_ROUNDROBIN && !bond->rr_tx_counter) {
>-             bond->rr_tx_counter = alloc_percpu(u32);
>-             if (!bond->rr_tx_counter)
>-                     return -ENOMEM;
>-     }
>-
>       /* reset slave->backup and slave->inactive */
>       if (bond_has_slaves(bond)) {
>                       bond_for_each_slave(bond, slave, iter) {
>@@ -5532,6 +5526,8 @@ bond_xdp_get_xmit_slave(struct net_device *bond_dev, 
>struct xdp_buff *xdp)
>
>       switch (BOND_MODE(bond)) {
>       case BOND_MODE_ROUNDROBIN:
>+             /* Pairs with smp_wmb() in bond_option_mode_set() */
>+             smp_rmb();
>                       slave = bond_xdp_xmit_roundrobin_slave_get(bond, xdp);
>                       break;
>
>@@ -6411,6 +6407,14 @@ static int bond_init(struct net_device *bond_dev)
>       if (!bond->wq)
>                       return -ENOMEM;
>
>+     /* Default mode is round-robin, allocate rr_tx_counter for it.
>+      * For mode changes, bond_option_mode_set() manages the lifecycle.
>+      */
>+     bond->rr_tx_counter = alloc_percpu(u32);
>+     if (!bond->rr_tx_counter) {
>+             destroy_workqueue(bond->wq);
>+             return -ENOMEM;
>+     }
>+
>       bond->notifier_ctx = false;
>
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -918,7 +918,27 @@ static int bond_option_mode_set(struct bonding *bond,
>       /* don't cache arp_validate between modes */
>       bond->params.arp_validate = BOND_ARP_VALIDATE_NONE;
>-     bond->params.mode = newval->value;
>+
>+     if (newval->value == BOND_MODE_ROUNDROBIN) {
>+             /* Switching to round-robin: allocate before setting mode,
>+              * so XDP path seeing BOND_MODE_ROUNDROBIN always finds
>+              * rr_tx_counter allocated.
>+              */
>+             if (!bond->rr_tx_counter) {
>+                     bond->rr_tx_counter = alloc_percpu(u32);
>+                     if (!bond->rr_tx_counter)
>+                             return -ENOMEM;
>+             }
>+             /* Pairs with smp_rmb() in bond_xdp_get_xmit_slave() */
>+             smp_wmb();
>+             bond->params.mode = newval->value;
>+     } else {
>+             /* Switching away: set mode first so XDP no longer
>+              * enters RR branch before we free rr_tx_counter.
>+              */
>+             bond->params.mode = newval->value;
>+             /* Pairs with smp_rmb() in bond_xdp_get_xmit_slave() */
>+             smp_wmb();
>+             free_percpu(bond->rr_tx_counter);
>+             bond->rr_tx_counter = NULL;
>+     }
>
>Thanks,
>Jiayuan
>
>
>> > 
>> > Thanks
>> >  
>> >  -J
>> > 
>> Sebastian
>>

---
        -Jay Vosburgh, [email protected]

Reply via email to