BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
useful: an XDP program receiving a VLAN-tagged frame on a physical
device wants the lookup to behave as if the packet had arrived on the
corresponding VLAN subinterface, so iif-based policy routing and VRF
table selection use the right ingress.

Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
The device must be up and in the same network namespace as
params->ifindex (a VLAN device can be moved to another netns while
registered on its parent; receive would deliver into that other
namespace, which a lookup here cannot represent). If params->ifindex
is itself a VLAN device, its inner (QinQ) subinterface is matched.
For a bond or team, a tag on a port matches no device and returns
NOT_FWDED; pass the master's ifindex.
The lookup then runs with the resolved device as the ingress;
params->ifindex itself is not modified on the input side. When the
resolved device is enslaved to a VRF, both the full lookup (via the
l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
select the VRF's table from the resolved ingress. That follows from
feeding the resolved device to the flow as the ingress
(fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
the VRF master from the subinterface rather than from
params->ifindex.

The two failure classes get different treatment on purpose. A
h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
-EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
with a program-controlled value. An unmatched VID, a device that is
down, or one in another namespace is a data outcome and returns
BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
fib_get_table() finds no table and mirroring real ingress, where the
receive path drops such frames. A VID of 0 (a priority tag) is looked
up literally and normally fails the same way; receive instead
processes such frames untagged, so callers should not set the flag for
priority tags. Proceeding on the physical device for any of these
would be fail-open for the policy-routing cases above.

The h_vlan fields share a union with tbid, so the flag cannot be
combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
return -EINVAL; restricting now keeps a later relaxation backward
compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
consumed on the ingress side and the egress tag is written on
success.

Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
NULL, so every lookup with the flag returns NOT_FWDED, which is
correct since no VLAN device can exist.

Suggested-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Avinash Duduskar <[email protected]>
---
 include/uapi/linux/bpf.h       | 34 ++++++++++++++-
 net/core/filter.c              | 80 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 34 ++++++++++++++-
 3 files changed, 141 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f77aa9472bf1..57e28da3336a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3552,6 +3552,35 @@ union bpf_attr {
  *                     reports the route mtu in *params*->mtu_result, and on
  *                     the tc path without tot_len the mtu check runs after
  *                     the swap, against the parent device.
+ *             **BPF_FIB_LOOKUP_VLAN_INPUT**
+ *                     Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *                     as an input VLAN tag (e.g. parsed from the packet) and
+ *                     run the lookup as if ingress had happened on the VLAN
+ *                     subinterface carrying that tag for *params*->ifindex,
+ *                     rather than on *params*->ifindex itself. The VID is the
+ *                     low 12 bits of *params*->h_vlan_TCI;
+ *                     *params*->h_vlan_proto must be ETH_P_8021Q or
+ *                     ETH_P_8021AD in network byte order (any other value
+ *                     returns **-EINVAL**). The
+ *                     subinterface is the one configured for that tag on
+ *                     *params*->ifindex; if *params*->ifindex is itself a
+ *                     VLAN device, its inner (QinQ) subinterface is matched.
+ *                     For a bond or team, a tag on a port matches no
+ *                     device and returns NOT_FWDED; pass the master's
+ *                     ifindex.
+ *                     If no matching subinterface exists, or it is not up,
+ *                     or it was moved to another network namespace, the
+ *                     lookup returns **BPF_FIB_LKUP_RET_NOT_FWDED**,
+ *                     mirroring real ingress, which drops a frame whose tag
+ *                     is unconfigured or whose VLAN device is down. A VID of
+ *                     0 (a priority-tagged frame) is looked up literally like
+ *                     any other VID; receive instead processes such frames
+ *                     untagged on the device itself, so do not set this flag
+ *                     for priority tags.
+ *                     Cannot be combined with **BPF_FIB_LOOKUP_TBID** (both
+ *                     use the same input fields) or **BPF_FIB_LOOKUP_OUTPUT**
+ *                     (this flag is ingress-only); doing so returns
+ *                     **-EINVAL**.
  *
  *             *ctx* is either **struct xdp_md** for XDP programs or
  *             **struct sk_buff** tc cls_act programs.
@@ -7348,6 +7377,7 @@ enum {
        BPF_FIB_LOOKUP_SRC     = (1U << 4),
        BPF_FIB_LOOKUP_MARK    = (1U << 5),
        BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+       BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7416,7 +7446,9 @@ struct bpf_fib_lookup {
                struct {
                        /* output with BPF_FIB_LOOKUP_VLAN: set from the
                         * resolved egress VLAN device (see the flag); zeroed
-                        * on other successful lookups.
+                        * on other successful lookups. input with
+                        * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+                        * the lookup by.
                         */
                        __be16  h_vlan_proto;
                        __be16  h_vlan_TCI;
diff --git a/net/core/filter.c b/net/core/filter.c
index b37a12321fba..cfbdd842ce61 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6158,6 +6158,41 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
 
        return 0;
 }
+
+/* With BPF_FIB_LOOKUP_VLAN_INPUT the caller passes the packet's VLAN tag in
+ * params->h_vlan_proto and params->h_vlan_TCI; the lookup is done as if
+ * ingress had happened on the matching VLAN subinterface of *dev. Resolve
+ * it and store it in *dev. params is not modified.
+ *
+ * A protocol other than 802.1Q/802.1AD is API misuse (it would otherwise
+ * reach the WARN in vlan_proto_idx()), so it is rejected with -EINVAL. An
+ * unmatched VID, a matching device that is down, or one that was moved
+ * to another netns (receive would deliver into that netns' stack, which
+ * a lookup here cannot represent) is a data outcome, reported as
+ * NOT_FWDED, the same way the DIRECT path reports a missing table. Under
+ * !CONFIG_VLAN_8021Q __vlan_find_dev_deep_rcu() returns NULL, so every
+ * call returns NOT_FWDED, which is correct since no subinterface can
+ * exist.
+ */
+static int bpf_fib_vlan_input_dev(struct net_device **dev,
+                                 const struct bpf_fib_lookup *params)
+{
+       __be16 proto = params->h_vlan_proto;
+       struct net_device *vlan_dev;
+       u16 vid;
+
+       if (proto != htons(ETH_P_8021Q) && proto != htons(ETH_P_8021AD))
+               return -EINVAL;
+
+       vid = ntohs(params->h_vlan_TCI) & VLAN_VID_MASK;
+       vlan_dev = __vlan_find_dev_deep_rcu(*dev, proto, vid);
+       if (!vlan_dev || !(vlan_dev->flags & IFF_UP) ||
+           !net_eq(dev_net(vlan_dev), dev_net(*dev)))
+               return BPF_FIB_LKUP_RET_NOT_FWDED;
+
+       *dev = vlan_dev;
+       return 0;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_INET)
@@ -6177,6 +6212,12 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
        if (unlikely(!dev))
                return -ENODEV;
 
+       if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+               err = bpf_fib_vlan_input_dev(&dev, params);
+               if (err)
+                       return err;
+       }
+
        /* verify forwarding is enabled on this interface */
        in_dev = __in_dev_get_rcu(dev);
        if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
@@ -6186,7 +6227,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
                fl4.flowi4_iif = 1;
                fl4.flowi4_oif = params->ifindex;
        } else {
-               fl4.flowi4_iif = params->ifindex;
+               /* dev->ifindex, not params->ifindex: VLAN_INPUT may have
+                * resolved dev to a subinterface above.
+                */
+               fl4.flowi4_iif = dev->ifindex;
                fl4.flowi4_oif = 0;
        }
        fl4.flowi4_dscp = inet_dsfield_to_dscp(params->tos);
@@ -6323,6 +6367,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
        if (unlikely(!dev))
                return -ENODEV;
 
+       if (flags & BPF_FIB_LOOKUP_VLAN_INPUT) {
+               err = bpf_fib_vlan_input_dev(&dev, params);
+               if (err)
+                       return err;
+       }
+
        idev = __in6_dev_get_safely(dev);
        if (unlikely(!idev || !READ_ONCE(idev->cnf.forwarding)))
                return BPF_FIB_LKUP_RET_FWD_DISABLED;
@@ -6331,7 +6381,11 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
                fl6.flowi6_iif = 1;
                oif = fl6.flowi6_oif = params->ifindex;
        } else {
-               oif = fl6.flowi6_iif = params->ifindex;
+               /* dev->ifindex, not params->ifindex: VLAN_INPUT may have
+                * resolved dev to a subinterface above.
+                */
+               oif = dev->ifindex;
+               fl6.flowi6_iif = oif;
                fl6.flowi6_oif = 0;
                strict = RT6_LOOKUP_F_HAS_SADDR;
        }
@@ -6443,7 +6497,23 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 #define BPF_FIB_LOOKUP_MASK (BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT | \
                             BPF_FIB_LOOKUP_SKIP_NEIGH | BPF_FIB_LOOKUP_TBID | \
                             BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_MARK | \
-                            BPF_FIB_LOOKUP_VLAN)
+                            BPF_FIB_LOOKUP_VLAN | BPF_FIB_LOOKUP_VLAN_INPUT)
+
+static bool bpf_fib_lookup_flags_ok(u32 flags)
+{
+       if (flags & ~BPF_FIB_LOOKUP_MASK)
+               return false;
+
+       /* VLAN_INPUT reads h_vlan_proto/h_vlan_TCI, which alias tbid, so it
+        * cannot be combined with TBID. It is also ingress-only, so it
+        * cannot be combined with the egress-perspective OUTPUT flag.
+        */
+       if ((flags & BPF_FIB_LOOKUP_VLAN_INPUT) &&
+           (flags & (BPF_FIB_LOOKUP_TBID | BPF_FIB_LOOKUP_OUTPUT)))
+               return false;
+
+       return true;
+}
 
 BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
           struct bpf_fib_lookup *, params, int, plen, u32, flags)
@@ -6451,7 +6521,7 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
        if (plen < sizeof(*params))
                return -EINVAL;
 
-       if (flags & ~BPF_FIB_LOOKUP_MASK)
+       if (!bpf_fib_lookup_flags_ok(flags))
                return -EINVAL;
 
        switch (params->family) {
@@ -6489,7 +6559,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
        if (plen < sizeof(*params))
                return -EINVAL;
 
-       if (flags & ~BPF_FIB_LOOKUP_MASK)
+       if (!bpf_fib_lookup_flags_ok(flags))
                return -EINVAL;
 
        if (params->tot_len)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f77aa9472bf1..57e28da3336a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3552,6 +3552,35 @@ union bpf_attr {
  *                     reports the route mtu in *params*->mtu_result, and on
  *                     the tc path without tot_len the mtu check runs after
  *                     the swap, against the parent device.
+ *             **BPF_FIB_LOOKUP_VLAN_INPUT**
+ *                     Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
+ *                     as an input VLAN tag (e.g. parsed from the packet) and
+ *                     run the lookup as if ingress had happened on the VLAN
+ *                     subinterface carrying that tag for *params*->ifindex,
+ *                     rather than on *params*->ifindex itself. The VID is the
+ *                     low 12 bits of *params*->h_vlan_TCI;
+ *                     *params*->h_vlan_proto must be ETH_P_8021Q or
+ *                     ETH_P_8021AD in network byte order (any other value
+ *                     returns **-EINVAL**). The
+ *                     subinterface is the one configured for that tag on
+ *                     *params*->ifindex; if *params*->ifindex is itself a
+ *                     VLAN device, its inner (QinQ) subinterface is matched.
+ *                     For a bond or team, a tag on a port matches no
+ *                     device and returns NOT_FWDED; pass the master's
+ *                     ifindex.
+ *                     If no matching subinterface exists, or it is not up,
+ *                     or it was moved to another network namespace, the
+ *                     lookup returns **BPF_FIB_LKUP_RET_NOT_FWDED**,
+ *                     mirroring real ingress, which drops a frame whose tag
+ *                     is unconfigured or whose VLAN device is down. A VID of
+ *                     0 (a priority-tagged frame) is looked up literally like
+ *                     any other VID; receive instead processes such frames
+ *                     untagged on the device itself, so do not set this flag
+ *                     for priority tags.
+ *                     Cannot be combined with **BPF_FIB_LOOKUP_TBID** (both
+ *                     use the same input fields) or **BPF_FIB_LOOKUP_OUTPUT**
+ *                     (this flag is ingress-only); doing so returns
+ *                     **-EINVAL**.
  *
  *             *ctx* is either **struct xdp_md** for XDP programs or
  *             **struct sk_buff** tc cls_act programs.
@@ -7348,6 +7377,7 @@ enum {
        BPF_FIB_LOOKUP_SRC     = (1U << 4),
        BPF_FIB_LOOKUP_MARK    = (1U << 5),
        BPF_FIB_LOOKUP_VLAN    = (1U << 6),
+       BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
 };
 
 enum {
@@ -7416,7 +7446,9 @@ struct bpf_fib_lookup {
                struct {
                        /* output with BPF_FIB_LOOKUP_VLAN: set from the
                         * resolved egress VLAN device (see the flag); zeroed
-                        * on other successful lookups.
+                        * on other successful lookups. input with
+                        * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
+                        * the lookup by.
                         */
                        __be16  h_vlan_proto;
                        __be16  h_vlan_TCI;
-- 
2.54.0


Reply via email to