On 5/14/2026 3:01 AM, David Laight wrote: > On Wed, 13 May 2026 21:47:11 -0700 > John Ousterhout <[email protected]> wrote: > >> On Wed, May 13, 2026 at 1:49 PM David Laight >> <[email protected]> wrote: >>> >>> On Wed, 13 May 2026 09:28:40 -0700 >>> John Ousterhout <[email protected]> wrote: >>> >>>> On Wed, May 13, 2026 at 2:07 AM David Laight >>>> <[email protected]> wrote: >>>>> >>>>> On Tue, 12 May 2026 11:19:53 -0700 >>>>> John Ousterhout <[email protected]> wrote: >>>>> >>>>>> Consider the following sequence of events: >>>>>> * The bottom half of a buffer page is filled with data from >>>>>> packet A. The page has a net reference count (reference count >>>>>> - bias) of 1. The page is returned to the NIC, flipped to >>>>>> use the top half. >>>>>> * Before the reference on the page is released, the NIC returns >>>>>> the page with no data in it ('size' is zero in ice_clean_rx_irq). >>>>>> In this case the bias does not get decremented. The page still >>>>>> has a net reference count of 1, so it gets returned to the NIC. >>>>>> However, ice_put_rx_mbuf flipped the page so that the bottom >>>>>> half is active. >>>>>> * If the NIC stores another packet in the page before packet A >>>>>> has released its reference, the data in packet A will be >>>>>> overwritten with data from the new packet. >>>>>> * Unfortunately zero-length buffers occur frequently: they seem >>>>>> to occur whenever a packet uses every available byte in a >>>>>> buffer, ending precisely at the end of the buffer. When this >>>>>> happens the NIC seems to generate an extra zero-length >>>>>> buffer. >>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a >>>>>> size of 0. >>>>> >>>>> How is this different from packet B (in the top half) being >>>>> freed before packet A (in the bottom half)? >>>> >>>> I'm not sure exactly what you're referring to here. Are you asking >>>> about a situation where both halves of the page get filled with packet >>>> data and then the second half to be filled is the first to be freed? I >>>> believe that the ICE driver abandons a page if both halves are ever >>>> occupied simultaneously; the page will be returned to the system once >>>> both halves have dropped their references. Thus it doesn't matter >>>> which half is freed first. >>> >>> That is what I was thinking, seems like the logic is over complicated. >>> >>> If you need to put 4k pages into some kind of iommu rather than 2k buffers >>> (to contain 1536 byte ethernet packets) then I'd have thought you'd >>> initially put both halves into adjacent tx ring entries. >>> If a rx buffer is discarded (eg a zero length fragment or a CRC error, >>> or even 'copy break' for short packets) then, as an optimisation, >>> you could reuse the buffer for another receive. >>> The same could be done if the page is freed by an application. >>> >>> However it sounds like it doesn't use the 2nd half until the first >>> completes - otherwise you'd never 'flip' to make the other half >>> active. >>> >>> Thinks... >>> By only putting half of each 4k 'page' into the rx ring the code >>> will usually save (expensive) iommu setup in the (probably) normal >>> case where the buffers are freed 'reasonably quickly'. >>> But that really requires a 'free/with_nic/busy' state for each half >>> rather then trying to guess from a reference count. >>> >>> But if the low-level code is recycling the rx buffer (for any reason) >>> it wants to use the same buffer. >>> >>> The ethernet driver I wrote (a long time ago, early 90s) allocated >>> 64k as 128 512byte buffers and did an aligned word-sized copy of >>> every receive frame - most frames were in contiguous memory. >>> The simplicity of it made up for the cost of the copy, especially >>> since that was an iommu system. >> >> I'm not here to defend the logic (and it has been replaced with >> something that is probably simpler and more efficient); I'm just >> suggesting a bug fix for the stable releases that still have this >> logic.
Right. We definitely want a fix for the possible data corruption in stable. Ideally one as simple as possible. > > You've forced me to look at all of the function :-) > I've noticed a few things: > - If ice_add_xdp_frag() fails (because there are too many fragments) > then the rest of the fragments are left in the tx ring (instead > of being discarded) - so are likely to be treated as a full packet > later on. > - Frames with status errors (crc, framing etc) are discarded after > the skb is built - surely that should happen before the xdp 'program' > is called. > - If the remote system send a very very long frame (traditionally the PHY's > 'jabber detect' didn't always work) you can end up with all of the rx > ring being full of a single partial packet. > > I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'. > Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for > zero length fragments. > The comment would be 'zero length fragments can always be reused'. > That seems correct. > The zero length fragments almost certainly exist because the mac hardware > advances the the new buffer expecting more data - but only gets the > 4 byte CRC. So the zero length buffer contains the receive status. > That matches my understanding.
