On 5/14/26 18:43, Jacob Keller wrote:
On 5/14/2026 3:01 AM, David Laight wrote:
On Wed, 13 May 2026 21:47:11 -0700
John Ousterhout <[email protected]> wrote:

On Wed, May 13, 2026 at 1:49 PM David Laight
<[email protected]> wrote:
On Wed, 13 May 2026 09:28:40 -0700
John Ousterhout <[email protected]> wrote:
On Wed, May 13, 2026 at 2:07 AM David Laight
<[email protected]> wrote:
On Tue, 12 May 2026 11:19:53 -0700
John Ousterhout <[email protected]> wrote:
Consider the following sequence of events:
* The bottom half of a buffer page is filled with data from
   packet A. The page has a net reference count (reference count
   - bias) of 1. The page is returned to the NIC, flipped to
   use the top half.
* Before the reference on the page is released, the NIC returns
   the page with no data in it ('size' is zero in ice_clean_rx_irq).
   In this case the bias does not get decremented. The page still
   has a net reference count of 1, so it gets returned to the NIC.
   However, ice_put_rx_mbuf flipped the page so that the bottom
   half is active.
* If the NIC stores another packet in the page before packet A
   has released its reference, the data in packet A will be
   overwritten with data from the new packet.
* Unfortunately zero-length buffers occur frequently: they seem
   to occur whenever a packet uses every available byte in a
   buffer, ending precisely at the end of the buffer. When this
   happens the NIC seems to generate an extra zero-length
   buffer.
The fix is for ice_put_rx_mbuf not to flip pages that have a
size of 0.
How is this different from packet B (in the top half) being
freed before packet A (in the bottom half)?
I'm not sure exactly what you're referring to here. Are you asking
about a situation where both halves of the page get filled with packet
data and then the second half to be filled is the first to be freed? I
believe that the ICE driver abandons a page if both halves are ever
occupied simultaneously; the page will be returned to the system once
both halves have dropped their references. Thus it doesn't matter
which half is freed first.
That is what I was thinking, seems like the logic is over complicated.

If you need to put 4k pages into some kind of iommu rather than 2k buffers
(to contain 1536 byte ethernet packets) then I'd have thought you'd
initially put both halves into adjacent tx ring entries.
If a rx buffer is discarded (eg a zero length fragment or a CRC error,
or even 'copy break' for short packets) then, as an optimisation,
you could reuse the buffer for another receive.
The same could be done if the page is freed by an application.

However it sounds like it doesn't use the 2nd half until the first
completes - otherwise you'd never 'flip' to make the other half
active.

Thinks...
By only putting half of each 4k 'page' into the rx ring the code
will usually save (expensive) iommu setup in the (probably) normal
case where the buffers are freed 'reasonably quickly'.
But that really requires a 'free/with_nic/busy' state for each half
rather then trying to guess from a reference count.

But if the low-level code is recycling the rx buffer (for any reason)
it wants to use the same buffer.

The ethernet driver I wrote (a long time ago, early 90s) allocated
64k as 128 512byte buffers and did an aligned word-sized copy of
every receive frame - most frames were in contiguous memory.
The simplicity of it made up for the cost of the copy, especially
since that was an iommu system.
I'm not here to defend the logic (and it has been replaced with
something that is probably simpler and more efficient); I'm just
suggesting a bug fix for the stable releases that still have this
logic.
Right. We definitely want a fix for the possible data corruption in
stable. Ideally one as simple as possible.

You've forced me to look at all of the function :-)
I've noticed a few things:
- If ice_add_xdp_frag() fails (because there are too many fragments)
   then the rest of the fragments are left in the tx ring (instead
   of being discarded) - so are likely to be treated as a full packet
   later on.
- Frames with status errors (crc, framing etc) are discarded after
   the skb is built - surely that should happen before the xdp 'program'
   is called.
- If the remote system send a very very long frame (traditionally the PHY's
   'jabber detect' didn't always work) you can end up with all of the rx
   ring being full of a single partial packet.

I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
zero length fragments.
The comment would be 'zero length fragments can always be reused'.

That seems correct.

The zero length fragments almost certainly exist because the mac hardware
advances the the new buffer expecting more data - but only gets the
4 byte CRC. So the zero length buffer contains the receive status.

That matches my understanding.
Hi John,

I have been looking at the same area in the pre-page-pool ice code and
I want to ask whether you observed memory growth during your Homa runs
that exposed the corruption, because in my testing the same bias mismatch
also produces a slow page leak that your v3 does not close.

Short version of the leak path, in the PASS (!CONSUMED) branch:

  1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
     (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
     refcount underflow") as the fix for the matching panic).
  2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
     never attached to the xdp_buff/SKB. Nobody downstream will ever
     call put_page() to balance the pagecnt_bias-- from step 1.
  3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
     size==0, which closes the corruption window. But it does not
     restore pagecnt_bias for that zero size buffer, so the page is
     handed back to ice_reuse_rx_page() with a permanent deficit of 1.
  4. On the next reuse of that page with size > 0, pagecnt_bias drops
     again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
     drains via __page_frag_cache_drain(page, pagecnt_bias). Because
     pagecnt_bias is one too low, the drain undershoots by 1: page
     refcount stays at 2 instead of 1.
  5. The SKB eventually releases its reference (refcount -> 1), but
     nothing ever brings it to 0. The page is leaked.
     ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.

At the zero size frequency you mentioned (thousands per second), this
adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
originally reported against 6.13.y on NUMA nodes and which motivated
the libeth/page_pool conversion in mainline. So in stable trees the
leak side of this bug is still live.

Two questions:

  - Did you monitor RSS / page allocator stats over the duration of
    your Homa runs? If you did and did not see growth, I would like
    to understand what is different about your setup, because by my
    reading of the code the leak should fire whenever both halves of
    a page end up in SKBs simultaneously and one of them carried a
    zero size descriptor along the way.

  - If your focus was specifically the corruption, would you be open
    to extending v3 (or replacing it) with a fix that also restores
    pagecnt_bias for the size==0 case? The minimal extension is one
    extra branch in ice_put_rx_mbuf:

        if (verdict != ICE_XDP_CONSUMED && size != 0)
                ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
        else
                buf->pagecnt_bias++;

    which restores bias on every path where the page is not actually
    going out to an SKB. (I have a slightly different variant that
    tracks has_data in struct ice_rx_buf to also handle the broken
    positional 'i <= xdp_frags' counter in the CONSUMED path, where
    zero size descriptors in the middle of a frame steal bias++ slots
    from real fragments. Happy to share it if useful.)

Regards,
Petr

Reply via email to