On 6/10/26 6:28 PM, Sun Jian wrote:
Generic XDP devmap multi redirect uses skb_clone() for the
intermediate destinations and sends the last destination with the
original skb. This can leave multiple destinations sharing the same
packet data.
This becomes visible when a devmap egress program mutates packet data.
One destination can observe changes made for another destination. The
last-destination path has the same problem: the last destination runs on
the original skb, so its egress program can modify packet data still
shared with earlier cloned skbs.
Native XDP broadcast redirect does not have this issue because
xdpf_clone() copies the frame data for each destination. Generic XDP
should provide the same per-destination isolation before running a
devmap egress program.
Fix this by making cloned skbs private in dev_map_generic_redirect()
before running the devmap egress program. Use skb_copy() instead of
skb_unshare() so that allocation failure does not consume the skb and
the existing caller error paths keep their ownership semantics.
Add a selftest that covers the last-destination case where earlier
destinations do not have a devmap egress program, while the final
destination does.
Tested with:
./test_progs -t xdp_veth_egress
./test_progs -t xdp_veth
./test_progs -t xdp
Fixes: e624d4ed4aa8 ("xdp: Extend xdp_redirect_map with broadcast support")
Suggested-by: Jiayuan Chen <[email protected]>
Signed-off-by: Sun Jian <[email protected]>
---
v1:
https://lore.kernel.org/bpf/CABFUUZFimdrZdq=NWi+N-0sJZWvMwY=f4iF6-3TVMS8=m07...@mail.gmail.com/
Changes in v2:
- Move the private-copy step into dev_map_generic_redirect() so the
last-destination path is covered as well.
- Use skb_copy() instead of skb_unshare() to keep caller ownership
unchanged on allocation failure.
- Add a generic XDP last-destination selftest case.
kernel/bpf/devmap.c | 10 ++
.../selftests/bpf/prog_tests/test_xdp_veth.c | 151 +++++++++++++++++-
2 files changed, 158 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index cc0a43ebab6b..59f267685bc6 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -700,12 +700,22 @@ int dev_map_enqueue_multi(struct xdp_frame *xdpf, struct
net_device *dev_rx,
int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
const struct bpf_prog *xdp_prog)
{
+ struct sk_buff *nskb;
int err;
err = xdp_ok_fwd_dev(dst->dev, skb->len);
if (unlikely(err))
return err;
+ if (dst->xdp_prog && skb_cloned(skb)) {
+ nskb = skb_copy(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ consume_skb(skb);
+ skb = nskb;
+ }
+
LGTM.
Please split selftest as a seprated patch of a patchset.
/* Redirect has already succeeded semantically at this point, so we just
* return 0 even if packet is dropped. Helper below takes care of
* freeing skb.
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c
b/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c
index 3e98a1665936..1f0b9ade12fe 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xdp_veth.c
@@ -456,7 +456,11 @@ static void xdp_veth_egress(u32 flags)
.remote_flags = flags,
[...]
+ for (i = 0; i < VETH_PAIRS_COUNT; i++) {
+ struct bpf_devmap_val devmap_val = {};
+ int ifindex = if_nametoindex(net_config.veth_cfg[i].local_veth);
+
+ SYS(destroy_xdp_redirect_map,
+ "ip -n %s neigh add %s lladdr 00:00:00:00:00:01 dev %s",
+ net_config.veth_cfg[i].namespace, IP_NEIGH,
+ net_config.veth_cfg[i].remote_veth);
+
+ if (attach_programs_to_veth_pair(bpf_objs, VETH_EGRESS_SKEL_NB,
+ &net_config, prog_cfg, i))
+ goto destroy_xdp_redirect_map;
+
+ err = bpf_map_update_elem(mac_map, &ifindex, egress_macs[i], 0);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto destroy_xdp_redirect_map;
+
+ devmap_val.ifindex = ifindex;
+ devmap_val.bpf_prog.fd = -1;
+
+ if (i == VETH_PAIRS_COUNT - 1)
+ devmap_val.bpf_prog.fd =
+
bpf_program__fd(xdp_redirect_multi_kern->progs.xdp_devmap_prog);
+
+ err = bpf_map_update_elem(egress_map, &ifindex, &devmap_val, 0);
The kernel walks DEVMAP_HASH in bucket order so I don't think we can
guarantee that the entry at VETH_PAIRS_COUNT - 1 is the last dst.
Could we use i as the key instead, so the entry order becomes
deterministic? (bucket is key & (n_buckets - 1).)