On 7/11/2025 8:33 AM, Maciej Fijalkowski wrote: > On Wed, Jul 09, 2025 at 12:07:30PM -0700, Jacob Keller wrote: >> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each >> buffer in the current frame. This function was introduced as part of >> handling multi-buffer XDP support in the ice driver. >> >> It works by iterating over the buffers from first_desc up to 1 plus the >> total number of fragments in the frame, cached from before the XDP program >> was executed. >> >> If the hardware posts a descriptor with a size of 0, the logic used in >> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added >> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a >> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't >> call ice_put_rx_buf(). >> >> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the >> page or free it. This leaves a stale page in the ring, as we don't >> increment next_to_alloc. >> >> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented >> properly, and that it always points to a buffer with a NULL page. Since >> this function doesn't check, it will happily recycle a page over the top >> of the next_to_alloc buffer, losing track of the old page. >> >> Note that this leak only occurs for multi-buffer frames. The >> ice_put_rx_mbuf() function always handles at least one buffer, so a >> single-buffer frame will always get handled correctly. It is not clear >> precisely why the hardware hands us descriptors with a size of 0 sometimes, >> but it happens somewhat regularly with "jumbo frames" used by 9K MTU. >> >> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on >> all buffers between first_desc and next_to_clean. Borrow the logic of a >> similar function in i40e used for this same purpose. Use the same logic >> also in ice_get_pgcnts(). >> >> Instead of iterating over just the number of fragments, use a loop which >> iterates until the current index reaches to the next_to_clean element just >> past the current frame. Check the current number of fragments (post XDP >> program). For all buffers up 1 more than the number of fragments, we'll >> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left >> as-is. This ensures that fragments released by the XDP program, as well as >> any buffers with zero-size won't have their pagecnt_bias updated >> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call >> ice_put_rx_buf() on the last buffer of the frame indicating end of packet. >> >> Move the increment of the ntc local variable to ensure its updated *before* >> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic >> requires the index of the element just after the current frame. >> >> This has the advantage that we also no longer need to track or cache the >> number of fragments in the rx_ring, which saves a few bytes in the ring. >> >> Cc: Christoph Petrausch <christoph.petrau...@deepl.com> >> Reported-by: Jaroslav Pulchart <jaroslav.pulch...@gooddata.com> >> Closes: >> https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/ >> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current >> frame") >> Signed-off-by: Jacob Keller <jacob.e.kel...@intel.com> >> --- >> I've tested this in a setup with MTU 9000, using a combination of iperf3 >> and wrk generated traffic. >> >> I tested this in a couple of ways. First, I check memory allocations using >> /proc/allocinfo: >> >> awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' >> /proc/allocinfo | numfmt --to=iec >> >> Second, I ported some stats from i40e written by Joe Damato to track the >> page allocation and busy counts. I consistently saw that the allocate stat >> increased without the busy or waive stats increasing. I also added a stat >> to track directly when we overwrote a page pointer that was non-NULL in >> ice_reuse_rx_page(), and saw it increment consistently. >> >> With this fix, all of these indicators are fixed. I've tested both 1500 >> byte and 9000 byte MTU and no longer see the leak. With the counters I was >> able to immediately see a leak within a few minutes of iperf3, so I am >> confident that I've resolved the leak with this fix. >> --- >> drivers/net/ethernet/intel/ice/ice_txrx.h | 1 - >> drivers/net/ethernet/intel/ice/ice_txrx.c | 71 >> ++++++++++++------------------- >> 2 files changed, 28 insertions(+), 44 deletions(-) >> >> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h >> b/drivers/net/ethernet/intel/ice/ice_txrx.h >> index a4b1e9514632..07155e615f75 100644 >> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h >> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h >> @@ -358,7 +358,6 @@ struct ice_rx_ring { >> struct ice_tx_ring *xdp_ring; >> struct ice_rx_ring *next; /* pointer to next ring in q_vector */ >> struct xsk_buff_pool *xsk_pool; >> - u32 nr_frags; >> u16 max_frame; >> u16 rx_buf_len; >> dma_addr_t dma; /* physical address of ring */ >> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c >> b/drivers/net/ethernet/intel/ice/ice_txrx.c >> index 0e5107fe62ad..b139066b6f0d 100644 >> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c >> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c >> @@ -865,10 +865,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct >> xdp_buff *xdp, >> __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page, >> rx_buf->page_offset, size); >> sinfo->xdp_frags_size += size; >> - /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail() >> - * can pop off frags but driver has to handle it on its own >> - */ >> - rx_ring->nr_frags = sinfo->nr_frags; >> >> if (page_is_pfmemalloc(rx_buf->page)) >> xdp_buff_set_frag_pfmemalloc(xdp); >> @@ -939,20 +935,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const >> unsigned int size, >> /** >> * ice_get_pgcnts - grab page_count() for gathered fragments >> * @rx_ring: Rx descriptor ring to store the page counts on >> + * @ntc: the next to clean element (not included in this frame!) >> * >> * This function is intended to be called right before running XDP >> * program so that the page recycling mechanism will be able to take >> * a correct decision regarding underlying pages; this is done in such >> * way as XDP program can change the refcount of page >> */ >> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring) >> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc) >> { >> - u32 nr_frags = rx_ring->nr_frags + 1; >> u32 idx = rx_ring->first_desc; >> struct ice_rx_buf *rx_buf; >> u32 cnt = rx_ring->count; >> >> - for (int i = 0; i < nr_frags; i++) { >> + while (idx != ntc) { >> rx_buf = &rx_ring->rx_buf[idx]; >> rx_buf->pgcnt = page_count(rx_buf->page); >> >> @@ -1125,62 +1121,49 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct >> ice_rx_buf *rx_buf) >> } >> >> /** >> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags >> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame >> * @rx_ring: Rx ring with all the auxiliary data >> * @xdp: XDP buffer carrying linear + frags part >> * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage >> - * @ntc: a current next_to_clean value to be stored at rx_ring >> + * @ntc: the next to clean element (not included in this frame!) >> * @verdict: return code from XDP program execution >> * >> - * Walk through gathered fragments and satisfy internal page >> - * recycle mechanism; we take here an action related to verdict >> - * returned by XDP program; >> + * Called after XDP program is completed, or on error with verdict set to >> + * ICE_XDP_CONSUMED. >> + * >> + * Walk through buffers from first_desc to the end of the frame, releasing >> + * buffers and satisfying internal page recycle mechanism. The action >> depends >> + * on verdict from XDP program. >> */ >> static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff >> *xdp, >> u32 *xdp_xmit, u32 ntc, u32 verdict) >> { >> - u32 nr_frags = rx_ring->nr_frags + 1; >> + u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags; >> u32 idx = rx_ring->first_desc; >> u32 cnt = rx_ring->count; >> - u32 post_xdp_frags = 1; >> struct ice_rx_buf *buf; >> - int i; >> + int i = 0; >> >> - if (unlikely(xdp_buff_has_frags(xdp))) >> - post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags; >> - >> - for (i = 0; i < post_xdp_frags; i++) { >> + while (idx != ntc) { >> buf = &rx_ring->rx_buf[idx]; >> + if (++idx == cnt) >> + idx = 0; >> >> - if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) { >> + /* An XDP program could release fragments from the end of the >> + * buffer. For these, we need to keep the pagecnt_bias as-is. >> + * To do this, only adjust pagecnt_bias for fragments up to >> + * the total remaining after the XDP program has run. >> + */ >> + if (verdict != ICE_XDP_CONSUMED) >> ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz); >> - *xdp_xmit |= verdict; > > Hi Jake, > > you're likely breaking XDP_REDIRECT/XDP_TX workloads. I believe you need > to give this patch a spin against xdp-bench and test all actions... > > anyways thanks for great analysis and bugfix! >
Maciej is right, we need to update xdp_xmit somewhere. I think doing so in ice_rx_put_mbuf is wrong (we're bitwise OR multiple times the same verdict once per buffer, and we only need to do this if an XDP program is run, so 2 of the 3 callers don't need to do this. One of them even passes NULL to xdp_xmit, which isn't ever checked, but happens to work because it doesn't pass XDP_TX or XDP_REDIR as the verdict). I'll fix this by dropping the xdp_xmit parameter to this function and updating the xdp_xmit outside in the one place where we need to do so. I'm building that version to test against xdp-bench to verify. Thanks for spotting this oversight! Hopefully I'll have v2 out soon but it might not make it before the weekend.
OpenPGP_signature.asc
Description: OpenPGP digital signature