Hi Toke,

On 9/10/24 10:34, Toke Høiland-Jørgensen wrote:
> Florian Kauer <florian.ka...@linutronix.de> writes:
> 
>> Currently, returning XDP_REDIRECT from a xdp/devmap program
>> is considered as invalid action and an exception is traced.
>>
>> While it might seem counterintuitive to redirect in a xdp/devmap
>> program (why not just redirect to the correct interface in the
>> first program?), we faced several use cases where supporting
>> this would be very useful.
>>
>> Most importantly, they occur when the first redirect is used
>> with the BPF_F_BROADCAST flag. Using this together with xdp/devmap
>> programs, enables to perform different actions on clones of
>> the same incoming frame. In that case, it is often useful
>> to redirect either to a different CPU or device AFTER the cloning.
>>
>> For example:
>> - Replicate the frame (for redundancy according to IEEE 802.1CB FRER)
>>   and then use the second redirect with a cpumap to select
>>   the path-specific egress queue.
>>
>> - Also, one of the paths might need an encapsulation that
>>   exceeds the MTU. So a second redirect can be used back
>>   to the ingress interface to send an ICMP FRAG_NEEDED packet.
>>
>> - For OAM purposes, you might want to send one frame with
>>   OAM information back, while the original frame in passed forward.
>>
>> To enable these use cases, add the XDP_REDIRECT case to
>> dev_map_bpf_prog_run. Also, when this is called from inside
>> xdp_do_flush, the redirect might add further entries to the
>> flush lists that are currently processed. Therefore, loop inside
>> xdp_do_flush until no more additional redirects were added.
>>
>> Signed-off-by: Florian Kauer <florian.ka...@linutronix.de>
> 
> This is an interesting use case! However, your implementation makes it
> way to easy to end up in a situation that loops forever, so I think we
> should add some protection against that. Some details below:
> 
>> ---
>>  include/linux/bpf.h        |  4 ++--
>>  include/trace/events/xdp.h | 10 ++++++----
>>  kernel/bpf/devmap.c        | 37 +++++++++++++++++++++++++++--------
>>  net/core/filter.c          | 48 
>> +++++++++++++++++++++++++++-------------------
>>  4 files changed, 65 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 3b94ec161e8c..1b57cbabf199 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2498,7 +2498,7 @@ struct sk_buff;
>>  struct bpf_dtab_netdev;
>>  struct bpf_cpu_map_entry;
>>  
>> -void __dev_flush(struct list_head *flush_list);
>> +void __dev_flush(struct list_head *flush_list, int *redirects);
>>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>>                  struct net_device *dev_rx);
>>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_frame *xdpf,
>> @@ -2740,7 +2740,7 @@ static inline struct bpf_token 
>> *bpf_token_get_from_fd(u32 ufd)
>>      return ERR_PTR(-EOPNOTSUPP);
>>  }
>>  
>> -static inline void __dev_flush(struct list_head *flush_list)
>> +static inline void __dev_flush(struct list_head *flush_list, int *redirects)
>>  {
>>  }
>>  
>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
>> index a7e5452b5d21..fba2c457e727 100644
>> --- a/include/trace/events/xdp.h
>> +++ b/include/trace/events/xdp.h
>> @@ -269,9 +269,9 @@ TRACE_EVENT(xdp_devmap_xmit,
>>  
>>      TP_PROTO(const struct net_device *from_dev,
>>               const struct net_device *to_dev,
>> -             int sent, int drops, int err),
>> +             int sent, int drops, int redirects, int err),
>>  
>> -    TP_ARGS(from_dev, to_dev, sent, drops, err),
>> +    TP_ARGS(from_dev, to_dev, sent, drops, redirects, err),
>>  
>>      TP_STRUCT__entry(
>>              __field(int, from_ifindex)
>> @@ -279,6 +279,7 @@ TRACE_EVENT(xdp_devmap_xmit,
>>              __field(int, to_ifindex)
>>              __field(int, drops)
>>              __field(int, sent)
>> +            __field(int, redirects)
>>              __field(int, err)
>>      ),
>>  
>> @@ -288,16 +289,17 @@ TRACE_EVENT(xdp_devmap_xmit,
>>              __entry->to_ifindex     = to_dev->ifindex;
>>              __entry->drops          = drops;
>>              __entry->sent           = sent;
>> +            __entry->redirects      = redirects;
>>              __entry->err            = err;
>>      ),
>>  
>>      TP_printk("ndo_xdp_xmit"
>>                " from_ifindex=%d to_ifindex=%d action=%s"
>> -              " sent=%d drops=%d"
>> +              " sent=%d drops=%d redirects=%d"
>>                " err=%d",
>>                __entry->from_ifindex, __entry->to_ifindex,
>>                __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
>> -              __entry->sent, __entry->drops,
>> +              __entry->sent, __entry->drops, __entry->redirects,
>>                __entry->err)
>>  );
>>  
>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> index 7878be18e9d2..89bdec49ea40 100644
>> --- a/kernel/bpf/devmap.c
>> +++ b/kernel/bpf/devmap.c
>> @@ -334,7 +334,8 @@ static int dev_map_hash_get_next_key(struct bpf_map 
>> *map, void *key,
>>  static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
>>                              struct xdp_frame **frames, int n,
>>                              struct net_device *tx_dev,
>> -                            struct net_device *rx_dev)
>> +                            struct net_device *rx_dev,
>> +                            int *redirects)
>>  {
>>      struct xdp_txq_info txq = { .dev = tx_dev };
>>      struct xdp_rxq_info rxq = { .dev = rx_dev };
>> @@ -359,6 +360,13 @@ static int dev_map_bpf_prog_run(struct bpf_prog 
>> *xdp_prog,
>>                      else
>>                              frames[nframes++] = xdpf;
>>                      break;
>> +            case XDP_REDIRECT:
>> +                    err = xdp_do_redirect(rx_dev, &xdp, xdp_prog);
>> +                    if (unlikely(err))
>> +                            xdp_return_frame_rx_napi(xdpf);
>> +                    else
>> +                            *redirects += 1;
>> +                    break;
> 
> It's a bit subtle, but dev_map_bpf_prog_run() also filters the list of
> frames in the queue in-place (the frames[nframes++] = xdpf; line above),
> which only works under the assumption that the array in bq->q is not
> modified while this loop is being run. But now you're adding a call in
> the middle that may result in the packet being put back on the same
> queue in the middle, which means that this assumption no longer holds.
> 
> So you need to clear the bq->q queue first for this to work.
> Specifically, at the start of bq_xmit_all(), you'll need to first copy
> all the packet pointer onto an on-stack array, then run the rest of the
> function on that array. There's already an initial loop that goes
> through all the frames, so you can just do it there.
> 
> So the loop at the start of bq_xmit_all() goes from the current:
> 
>       for (i = 0; i < cnt; i++) {
>               struct xdp_frame *xdpf = bq->q[i];
> 
>               prefetch(xdpf);
>       }
> 
> 
> to something like:
> 
>         struct xdp_frame *frames[DEV_MAP_BULK_SIZE];
> 
>       for (i = 0; i < cnt; i++) {
>               struct xdp_frame *xdpf = bq->q[i];
> 
>               prefetch(xdpf);
>                 frames[i] = xdpf;
>       }
> 
>         bq->count = 0; /* bq is now empty, use the 'frames' and 'cnt'
>                           stack variables for the rest of the function */
> 
> 
> 
>>              default:
>>                      bpf_warn_invalid_xdp_action(NULL, xdp_prog, act);
>>                      fallthrough;
>> @@ -373,7 +381,7 @@ static int dev_map_bpf_prog_run(struct bpf_prog 
>> *xdp_prog,
>>      return nframes; /* sent frames count */
>>  }
>>  
>> -static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>> +static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, int 
>> *redirects)
>>  {
>>      struct net_device *dev = bq->dev;
>>      unsigned int cnt = bq->count;
>> @@ -390,8 +398,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, 
>> u32 flags)
>>              prefetch(xdpf);
>>      }
>>  
>> +    int new_redirects = 0;
>>      if (bq->xdp_prog) {
>> -            to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, 
>> bq->dev_rx);
>> +            to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, 
>> bq->dev_rx,
>> +                            &new_redirects);
>>              if (!to_send)
>>                      goto out;
>>      }
>> @@ -413,19 +423,21 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, 
>> u32 flags)
>>  
>>  out:
>>      bq->count = 0;
>> -    trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err);
>> +    *redirects += new_redirects;
>> +    trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - new_redirects,
>> +                    new_redirects, err);
>>  }
>>  
>>  /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from 
>> the
>>   * driver before returning from its napi->poll() routine. See the comment 
>> above
>>   * xdp_do_flush() in filter.c.
>>   */
>> -void __dev_flush(struct list_head *flush_list)
>> +void __dev_flush(struct list_head *flush_list, int *redirects)
>>  {
>>      struct xdp_dev_bulk_queue *bq, *tmp;
>>  
>>      list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
>> -            bq_xmit_all(bq, XDP_XMIT_FLUSH);
>> +            bq_xmit_all(bq, XDP_XMIT_FLUSH, redirects);
>>              bq->dev_rx = NULL;
>>              bq->xdp_prog = NULL;
>>              __list_del_clearprev(&bq->flush_node);
>> @@ -458,8 +470,17 @@ static void bq_enqueue(struct net_device *dev, struct 
>> xdp_frame *xdpf,
>>  {
>>      struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>>  
>> -    if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
>> -            bq_xmit_all(bq, 0);
>> +    if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) {
>> +            int redirects = 0;
>> +
>> +            bq_xmit_all(bq, 0, &redirects);
>> +
>> +            /* according to comment above xdp_do_flush() in
>> +             * filter.c, xdp_do_flush is always called at
>> +             * the end of the NAPI anyway, so no need to act
>> +             * on the redirects here
>> +             */
> 
> While it's true that it will be called again in NAPI, the purpose of
> calling bq_xmit_all() here is to make room space for the packet on the
> bulk queue that we're about to enqueue, and if bq_xmit_all() can just
> put the packet back on the queue, there is no guarantee that this will
> succeed. So you will have to handle that case here.
> 
> Since there's also a potential infinite recursion issue in the
> do_flush() functions below, I think it may be better to handle this
> looping issue inside bq_xmit_all().
> 
> I.e., structure the code so that bq_xmit_all() guarantees that when it
> returns it has actually done its job; that is, that bq->q is empty.
> 
> Given the above "move all frames out of bq->q at the start" change, this
> is not all that hard. Simply add a check after the out: label (in
> bq_xmit_all()) to check if bq->count is actually 0, and if it isn't,
> start over from the beginning of that function. This also makes it
> straight forward to add a recursion limit; after looping a set number of
> times (say, XMIT_RECURSION_LIMIT), simply turn XDP_REDIRECT into drops.
> 
> There will need to be some additional protection against looping forever
> in __dev_flush(), to handle the case where a packet is looped between
> two interfaces. This one is a bit trickier, but a similar recursion
> counter could be used, I think.


Thanks a lot for the extensive support!
Regarding __dev_flush(), could the following already be sufficient?

void __dev_flush(struct list_head *flush_list)
{
        struct xdp_dev_bulk_queue *bq, *tmp;
        int i,j;

        for (i = 0; !list_empty(flush_list) && i < XMIT_RECURSION_LIMIT; i++) {
                /* go through list in reverse so that new items
                 * added to the flush_list will only be handled
                 * in the next iteration of the outer loop
                 */
                list_for_each_entry_safe_reverse(bq, tmp, flush_list, 
flush_node) {
                        bq_xmit_all(bq, XDP_XMIT_FLUSH);
                        bq->dev_rx = NULL;
                        bq->xdp_prog = NULL;
                        __list_del_clearprev(&bq->flush_node);
                }
        }

        if (i == XMIT_RECURSION_LIMIT) {
                pr_warn("XDP recursion limit hit, expect packet loss!\n");

                list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
                        for (j = 0; j < bq->count; j++)
                                xdp_return_frame_rx_napi(bq->q[i]);

                        bq->count = 0;
                        bq->dev_rx = NULL;
                        bq->xdp_prog = NULL;
                        __list_del_clearprev(&bq->flush_node);
                }
        }
}


> 
> In any case, this needs extensive selftests, including ones with devmap
> programs that loop packets (both redirect from a->a, and from
> a->b->a->b) to make sure the limits work correctly.


Good point! I am going to prepare some.


> 
>> +    }
>>  
>>      /* Ingress dev_rx will be the same for all xdp_frame's in
>>       * bulk_queue, because bq stored per-CPU and must be flushed
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 8569cd2482ee..b33fc0b1444a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -4287,14 +4287,18 @@ static const struct bpf_func_proto 
>> bpf_xdp_adjust_meta_proto = {
>>  void xdp_do_flush(void)
>>  {
>>      struct list_head *lh_map, *lh_dev, *lh_xsk;
>> +    int redirect;
>>  
>> -    bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>> -    if (lh_dev)
>> -            __dev_flush(lh_dev);
>> -    if (lh_map)
>> -            __cpu_map_flush(lh_map);
>> -    if (lh_xsk)
>> -            __xsk_map_flush(lh_xsk);
>> +    do {
>> +            redirect = 0;
>> +            bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>> +            if (lh_dev)
>> +                    __dev_flush(lh_dev, &redirect);
>> +            if (lh_map)
>> +                    __cpu_map_flush(lh_map);
>> +            if (lh_xsk)
>> +                    __xsk_map_flush(lh_xsk);
>> +    } while (redirect > 0);
>>  }
>>  EXPORT_SYMBOL_GPL(xdp_do_flush);
>>  
>> @@ -4303,20 +4307,24 @@ void xdp_do_check_flushed(struct napi_struct *napi)
>>  {
>>      struct list_head *lh_map, *lh_dev, *lh_xsk;
>>      bool missed = false;
>> +    int redirect;
>>  
>> -    bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>> -    if (lh_dev) {
>> -            __dev_flush(lh_dev);
>> -            missed = true;
>> -    }
>> -    if (lh_map) {
>> -            __cpu_map_flush(lh_map);
>> -            missed = true;
>> -    }
>> -    if (lh_xsk) {
>> -            __xsk_map_flush(lh_xsk);
>> -            missed = true;
>> -    }
>> +    do {
>> +            redirect = 0;
>> +            bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>> +            if (lh_dev) {
>> +                    __dev_flush(lh_dev, &redirect);
>> +                    missed = true;
>> +            }
>> +            if (lh_map) {
>> +                    __cpu_map_flush(lh_map);
>> +                    missed = true;
>> +            }
>> +            if (lh_xsk) {
>> +                    __xsk_map_flush(lh_xsk);
>> +                    missed = true;
>> +            }
>> +    } while (redirect > 0);
> 
> With the change suggested above (so that bq_xmit_all() guarantees the
> flush is completely done), this looping is not needed anymore. However,
> it becomes important in which *order* the flushing is done
> (__dev_flush() should always happen first), so adding a comment to note
> this would be good.


I guess, if we remove the loop here and we still want to cover the case of
redirecting from devmap program via cpumap, we need to fetch the lh_map again
after calling __dev_flush, right?
So I think we should no longer use bpf_net_ctx_get_all_used_flush_lists then:

        lh_dev = bpf_net_ctx_get_dev_flush_list();
        if (lh_dev)
                __dev_flush(lh_dev);
        lh_map = bpf_net_ctx_get_cpu_map_flush_list();
        if (lh_map)
                __cpu_map_flush(lh_map);
        lh_xsk = bpf_net_ctx_get_xskmap_flush_list();
        if (lh_xsk)
                __xsk_map_flush(lh_xsk);

But bpf_net_ctx_get_all_used_flush_lists also includes additional checks
for IS_ENABLED(CONFIG_BPF_SYSCALL) and IS_ENABLED(CONFIG_XDP_SOCKETS),
so I guess they should be directly included in the xdp_do_flush and
xdp_do_check_flushed?
Then we could remove the bpf_net_ctx_get_all_used_flush_lists.
Or do you have an idea for a more elegant solution?

Thanks,
Florian

Reply via email to