On Tue, May 26, 2026 at 5:47 AM Petr Oros <[email protected]> wrote: > > > On 5/14/26 18:43, Jacob Keller wrote: > > On 5/14/2026 3:01 AM, David Laight wrote: > >> On Wed, 13 May 2026 21:47:11 -0700 > >> John Ousterhout <[email protected]> wrote: > >> > >>> On Wed, May 13, 2026 at 1:49 PM David Laight > >>> <[email protected]> wrote: > >>>> On Wed, 13 May 2026 09:28:40 -0700 > >>>> John Ousterhout <[email protected]> wrote: > >>>> > >>>>> On Wed, May 13, 2026 at 2:07 AM David Laight > >>>>> <[email protected]> wrote: > >>>>>> On Tue, 12 May 2026 11:19:53 -0700 > >>>>>> John Ousterhout <[email protected]> wrote: > >>>>>> > >>>>>>> Consider the following sequence of events: > >>>>>>> * The bottom half of a buffer page is filled with data from > >>>>>>> packet A. The page has a net reference count (reference count > >>>>>>> - bias) of 1. The page is returned to the NIC, flipped to > >>>>>>> use the top half. > >>>>>>> * Before the reference on the page is released, the NIC returns > >>>>>>> the page with no data in it ('size' is zero in ice_clean_rx_irq). > >>>>>>> In this case the bias does not get decremented. The page still > >>>>>>> has a net reference count of 1, so it gets returned to the NIC. > >>>>>>> However, ice_put_rx_mbuf flipped the page so that the bottom > >>>>>>> half is active. > >>>>>>> * If the NIC stores another packet in the page before packet A > >>>>>>> has released its reference, the data in packet A will be > >>>>>>> overwritten with data from the new packet. > >>>>>>> * Unfortunately zero-length buffers occur frequently: they seem > >>>>>>> to occur whenever a packet uses every available byte in a > >>>>>>> buffer, ending precisely at the end of the buffer. When this > >>>>>>> happens the NIC seems to generate an extra zero-length > >>>>>>> buffer. > >>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a > >>>>>>> size of 0. > >>>>>> How is this different from packet B (in the top half) being > >>>>>> freed before packet A (in the bottom half)? > >>>>> I'm not sure exactly what you're referring to here. Are you asking > >>>>> about a situation where both halves of the page get filled with packet > >>>>> data and then the second half to be filled is the first to be freed? I > >>>>> believe that the ICE driver abandons a page if both halves are ever > >>>>> occupied simultaneously; the page will be returned to the system once > >>>>> both halves have dropped their references. Thus it doesn't matter > >>>>> which half is freed first. > >>>> That is what I was thinking, seems like the logic is over complicated. > >>>> > >>>> If you need to put 4k pages into some kind of iommu rather than 2k > >>>> buffers > >>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd > >>>> initially put both halves into adjacent tx ring entries. > >>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error, > >>>> or even 'copy break' for short packets) then, as an optimisation, > >>>> you could reuse the buffer for another receive. > >>>> The same could be done if the page is freed by an application. > >>>> > >>>> However it sounds like it doesn't use the 2nd half until the first > >>>> completes - otherwise you'd never 'flip' to make the other half > >>>> active. > >>>> > >>>> Thinks... > >>>> By only putting half of each 4k 'page' into the rx ring the code > >>>> will usually save (expensive) iommu setup in the (probably) normal > >>>> case where the buffers are freed 'reasonably quickly'. > >>>> But that really requires a 'free/with_nic/busy' state for each half > >>>> rather then trying to guess from a reference count. > >>>> > >>>> But if the low-level code is recycling the rx buffer (for any reason) > >>>> it wants to use the same buffer. > >>>> > >>>> The ethernet driver I wrote (a long time ago, early 90s) allocated > >>>> 64k as 128 512byte buffers and did an aligned word-sized copy of > >>>> every receive frame - most frames were in contiguous memory. > >>>> The simplicity of it made up for the cost of the copy, especially > >>>> since that was an iommu system. > >>> I'm not here to defend the logic (and it has been replaced with > >>> something that is probably simpler and more efficient); I'm just > >>> suggesting a bug fix for the stable releases that still have this > >>> logic. > > Right. We definitely want a fix for the possible data corruption in > > stable. Ideally one as simple as possible. > > > >> You've forced me to look at all of the function :-) > >> I've noticed a few things: > >> - If ice_add_xdp_frag() fails (because there are too many fragments) > >> then the rest of the fragments are left in the tx ring (instead > >> of being discarded) - so are likely to be treated as a full packet > >> later on. > >> - Frames with status errors (crc, framing etc) are discarded after > >> the skb is built - surely that should happen before the xdp 'program' > >> is called. > >> - If the remote system send a very very long frame (traditionally the PHY's > >> 'jabber detect' didn't always work) you can end up with all of the rx > >> ring being full of a single partial packet. > >> > >> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'. > >> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for > >> zero length fragments. > >> The comment would be 'zero length fragments can always be reused'. > >> > > That seems correct. > > > >> The zero length fragments almost certainly exist because the mac hardware > >> advances the the new buffer expecting more data - but only gets the > >> 4 byte CRC. So the zero length buffer contains the receive status. > >> > > That matches my understanding. > Hi John, > > I have been looking at the same area in the pre-page-pool ice code and > I want to ask whether you observed memory growth during your Homa runs > that exposed the corruption, because in my testing the same bias mismatch > also produces a slow page leak that your v3 does not close. > > Short version of the leak path, in the PASS (!CONSUMED) branch: > > 1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally > (added by commit ef68094cb09e ("ice: Fix kernel panic due to page > refcount underflow") as the fix for the matching panic). > 2. ice_add_xdp_frag() then returns 0 for size==0, so that page is > never attached to the xdp_buff/SKB. Nobody downstream will ever > call put_page() to balance the pagecnt_bias-- from step 1. > 3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for > size==0, which closes the corruption window. But it does not > restore pagecnt_bias for that zero size buffer, so the page is > handed back to ice_reuse_rx_page() with a permanent deficit of 1. > 4. On the next reuse of that page with size > 0, pagecnt_bias drops > again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and > drains via __page_frag_cache_drain(page, pagecnt_bias). Because > pagecnt_bias is one too low, the drain undershoots by 1: page > refcount stays at 2 instead of 1. > 5. The SKB eventually releases its reference (refcount -> 1), but > nothing ever brings it to 0. The page is leaked. > ice_alloc_rx_bufs() just allocates a fresh page to fill the slot. > > At the zero size frequency you mentioned (thousands per second), this > adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart > originally reported against 6.13.y on NUMA nodes and which motivated > the libeth/page_pool conversion in mainline. So in stable trees the > leak side of this bug is still live. > > Two questions: > > - Did you monitor RSS / page allocator stats over the duration of > your Homa runs? If you did and did not see growth, I would like > to understand what is different about your setup, because by my > reading of the code the leak should fire whenever both halves of > a page end up in SKBs simultaneously and one of them carried a > zero size descriptor along the way.
I have not monitored the page allocator stats. I'm not sure I know the best way to do this; I tried slabtop but it didn't seem to show significant growth in memory usage. > - If your focus was specifically the corruption, would you be open > to extending v3 (or replacing it) with a fix that also restores > pagecnt_bias for the size==0 case? The minimal extension is one > extra branch in ice_put_rx_mbuf: > > if (verdict != ICE_XDP_CONSUMED && size != 0) > ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz); > else > buf->pagecnt_bias++; > > which restores bias on every path where the page is not actually > going out to an SKB. (I have a slightly different variant that > tracks has_data in struct ice_rx_buf to also handle the broken > positional 'i <= xdp_frags' counter in the CONSUMED path, where > zero size descriptors in the middle of a frame steal bias++ slots > from real fragments. Happy to share it if useful.) My understanding of the ice driver is extremely limited. You may be right about the proposed fix, but I don't currently know enough to get comfortable with it. I think it might be better to separate your change into a different patch, which can be shepherded by people with appropriate understanding. -John-
