On Sat, Jun 7, 2025 at 10:47 PM Jakub Kicinski <k...@kernel.org> wrote: > > A not-so-careful NAT46 BPF program can crash the kernel > if it indiscriminately flips ingress packets from v4 to v6: > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > ip6_rcv_core (net/ipv6/ip6_input.c:190:20) > ipv6_rcv (net/ipv6/ip6_input.c:306:8) > process_backlog (net/core/dev.c:6186:4) > napi_poll (net/core/dev.c:6906:9) > net_rx_action (net/core/dev.c:7028:13) > do_softirq (kernel/softirq.c:462:3) > netif_rx (net/core/dev.c:5326:3) > dev_loopback_xmit (net/core/dev.c:4015:2) > ip_mc_finish_output (net/ipv4/ip_output.c:363:8) > NF_HOOK (./include/linux/netfilter.h:314:9) > ip_mc_output (net/ipv4/ip_output.c:400:5) > dst_output (./include/net/dst.h:459:9) > ip_local_out (net/ipv4/ip_output.c:130:9) > ip_send_skb (net/ipv4/ip_output.c:1496:8) > udp_send_skb (net/ipv4/udp.c:1040:8) > udp_sendmsg (net/ipv4/udp.c:1328:10) > > The output interface has a 4->6 program attached at ingress. > We try to loop the multicast skb back to the sending socket. > Ingress BPF runs as part of netif_rx(), pushes a valid v6 hdr > and changes skb->protocol to v6. We enter ip6_rcv_core which > tries to use skb_dst(). But the dst is still an IPv4 one left > after IPv4 mcast output. > > Clear the dst in all BPF helpers which change the protocol. > Also clear the dst if we did an encap or decap as those > will most likely make the dst stale. > Try to preserve metadata dsts, those may carry non-routing > metadata. > > Reviewed-by: Maciej Żenczykowski <m...@google.com> > Acked-by: Daniel Borkmann <dan...@iogearbox.net> > Fixes: d219df60a70e ("bpf: Add ipip6 and ip6ip decap support for > bpf_skb_adjust_room()") > Fixes: 1b00e0dfe7d0 ("bpf: update skb->protocol in bpf_skb_net_grow") > Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") > Signed-off-by: Jakub Kicinski <k...@kernel.org> > --- > v2: > - drop on encap/decap > - fix typo (protcol) > - add the test to the Makefile > v1: https://lore.kernel.org/20250604210604.257036-1-k...@kernel.org > > I wonder if we should not skip ingress (tc_skip_classify?) > for looped back packets in the first place. But that doesn't > seem robust enough vs multiple redirections to solve the crash. > > Ignoring LOOPBACK packets (like the NAT46 prog should) doesn't > work either, since BPF can change pkt_type arbitrarily. > > CC: martin....@linux.dev > CC: dan...@iogearbox.net > CC: john.fastab...@gmail.com > CC: eddy...@gmail.com > CC: s...@fomichev.me > CC: hao...@google.com > CC: will...@google.com > CC: william.xuanziy...@huawei.com > CC: alan.magu...@oracle.com > CC: b...@vger.kernel.org > CC: eduma...@google.com > CC: m...@google.com > CC: sh...@kernel.org > CC: linux-kselftest@vger.kernel.org > CC: yonghong.s...@linux.dev > --- > tools/testing/selftests/net/Makefile | 1 + > net/core/filter.c | 31 +++++++++++++++++++------- > tools/testing/selftests/net/nat6to4.sh | 15 +++++++++++++ > 3 files changed, 39 insertions(+), 8 deletions(-) > create mode 100755 tools/testing/selftests/net/nat6to4.sh > > diff --git a/tools/testing/selftests/net/Makefile > b/tools/testing/selftests/net/Makefile > index ea84b88bcb30..ab996bd22a5f 100644 > --- a/tools/testing/selftests/net/Makefile > +++ b/tools/testing/selftests/net/Makefile > @@ -27,6 +27,7 @@ TEST_PROGS += amt.sh > TEST_PROGS += unicast_extensions.sh > TEST_PROGS += udpgro_fwd.sh > TEST_PROGS += udpgro_frglist.sh > +TEST_PROGS += nat6to4.sh > TEST_PROGS += veth.sh > TEST_PROGS += ioam6.sh > TEST_PROGS += gro.sh > diff --git a/net/core/filter.c b/net/core/filter.c > index 327ca73f9cd7..d5917d6446f2 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -3406,8 +3406,14 @@ BPF_CALL_3(bpf_skb_change_proto, struct sk_buff *, > skb, __be16, proto, > * need to be verified first. > */ > ret = bpf_skb_proto_xlat(skb, proto); > + if (ret) > + return ret; > + > bpf_compute_data_pointers(skb); > - return ret; > + if (skb_valid_dst(skb)) > + skb_dst_drop(skb); > + > + return 0; > } > > static const struct bpf_func_proto bpf_skb_change_proto_proto = { > @@ -3554,6 +3560,9 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 > off, u32 len_diff, > else if (skb->protocol == htons(ETH_P_IPV6) && > flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4) > skb->protocol = htons(ETH_P_IP); > + > + if (skb_valid_dst(skb)) > + skb_dst_drop(skb); > } > > if (skb_is_gso(skb)) { > @@ -3581,6 +3590,7 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 > off, u32 len_diff, > static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, > u64 flags) > { > + bool decap = flags & BPF_F_ADJ_ROOM_DECAP_L3_MASK; > int ret; > > if (unlikely(flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO | > @@ -3603,13 +3613,18 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, > u32 off, u32 len_diff, > if (unlikely(ret < 0)) > return ret; > > - /* Match skb->protocol to new outer l3 protocol */ > - if (skb->protocol == htons(ETH_P_IP) && > - flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV6) > - skb->protocol = htons(ETH_P_IPV6); > - else if (skb->protocol == htons(ETH_P_IPV6) && > - flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV4) > - skb->protocol = htons(ETH_P_IP); > + if (decap) { > + /* Match skb->protocol to new outer l3 protocol */ > + if (skb->protocol == htons(ETH_P_IP) && > + flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV6) > + skb->protocol = htons(ETH_P_IPV6); > + else if (skb->protocol == htons(ETH_P_IPV6) && > + flags & BPF_F_ADJ_ROOM_DECAP_L3_IPV4) > + skb->protocol = htons(ETH_P_IP); > + > + if (skb_valid_dst(skb)) > + skb_dst_drop(skb); > + } > > if (skb_is_gso(skb)) { > struct skb_shared_info *shinfo = skb_shinfo(skb); > diff --git a/tools/testing/selftests/net/nat6to4.sh > b/tools/testing/selftests/net/nat6to4.sh > new file mode 100755 > index 000000000000..0ee859b622a4 > --- /dev/null > +++ b/tools/testing/selftests/net/nat6to4.sh > @@ -0,0 +1,15 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-2.0 > + > +NS="ns-peer-$(mktemp -u XXXXXX)" > + > +ip netns add "${NS}" > +ip -netns "${NS}" link set lo up > +ip -netns "${NS}" route add default via 127.0.0.2 dev lo > + > +tc -n "${NS}" qdisc add dev lo ingress > +tc -n "${NS}" filter add dev lo ingress prio 4 protocol ip \ > + bpf object-file nat6to4.bpf.o section schedcls/egress4/snat4 direct-action > + > +ip netns exec "${NS}" \ > + bash -c 'echo > 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789abc > | socat - UDP4-DATAGRAM:224.1.0.1:6666,ip-multicast-loop=1' > -- > 2.49.0
Submit away. 1 meta question: as this is a fix and will thus be backported into 5.4+ LTS, should this be split into two patches? Either making the test a follow up, or even going with only the crash fix in patch 1 and putting the 4-in-4 and 6-in-6 behavioural change in patch 2? We'd end up in the same state at tip of tree... but it would affect the LTS backports. Honestly I'm not even sure what's best.