intel: use separate array for desc status tracking

Morten Brørup Thu, 12 Feb 2026 04:38:31 -0800

> From: Bruce Richardson [mailto:[email protected]]
> Sent: Thursday, 12 February 2026 10.15
> 
> On Wed, Feb 11, 2026 at 10:51:56PM +0100, Morten Brørup wrote:
> > > Rather than writing a last_id for each individual descriptor, we
> can
> > > write one only for places where the "report status" (RS) bit is
> set,
> > > i.e. the descriptors which will be written back when done. The
> method
> > > used for marking what descriptors are free is also changed in the
> > > process, even if the last descriptor with the "done" bits set is
> past
> > > the expected point, we only track up to the expected point, and
> leave
> > > the rest to be counted as freed next time. This means that we
> always
> > > have the RS/DD bits set at fixed intervals, and we always track
> free
> > > slots in units of the same tx_free_thresh intervals.
> >
> > I'm not saying it's good or bad, I'm simply trying to understand the
> performance tradeoff...
> > I'm wondering if spreading fields over two separate arrays is
> beneficial when considering cache misses.
> >
> > This patch introduces a separate array, uint16_t txq->rs_last_id[],
> which is not in the same cache line as the txe array.
> >
> > So now, two separate cache lines must be updated, rs_last_id and txe.
> >
> > Previously, only txe needed updating.
> >
> > Assuming both rings are cold, how many cache misses would a burst of
> 32 (single-segment) packets cause...
> > Number of cache misses in the txe ring (before this patch, and
> after)?
> > Number of cache misses in the rs_last_id ring (after this patch)?
> >
> 
> The main txe ring has 4 elements per cacheline both before and after
> this
> patch. As an aside, 6 bytes of each 16 is wasted, and I have tried a
> couple of
> times to get us down to an 8 byte ring element and each time
> performance
> has dropped, presumably due to extra computation and branching for ring
> wraparound when we remove the precomputed next index values.
> 
> Anyway, for 32 single-segment packets, we obviously touch 8 cachelines
> both
> before an after this patch in the txe array. In the next_rs array, we
> should just read a single index, of 16 bytes, so the overall array has
> a
> fairly small cache footprint, 2 bytes per 32 ring entries. [Or one
> cacheline for a 1024-sized ring]
> 
> However, the performance improvements given by the rework of the
> cleanup
> handling are very noticible and not really to do with memory/cache
> footprint so much as doing less work, and less branching. For example:
> 
> * before this patch, on transmit we stored the index of the last
> descriptor
>   for each segment of each packet. It was to a hot cacheline, but it
> was still
>   (if no coalescing) potentially 32 stores per burst. After this patch,
> we
>   instead only store the index of the last segment for the packet
> crossing
>   the next multiple of 32 boundary. Going from potentially 32 stores to
> 1.
> 
> * When doing cleanup post transmit, we still have an array lookup to
>   determine the location of the NIC write back to indicate descriptor
> done.
>   However, while before this was being done to the txe array - which
> was
>   written (ring_size - 32) packets ago - it's now done by reading the
> new,
>   tiny rs_next_id array which is more likely to be in cache.
> 
> * finally, when doing the actual buffer freeing, before we would have
> an
>   arbitrary number of buffers - though normally 32 - to be freed from
> the
>   ring starting at an arbitrary location. After this patch, we always
> have
>   exactly 32 (or more correctly rs_thresh, which must divide evenly
> into
>   ring_size) buffers to be freed starting at an index which is a
> multiple of
>   32, and which therefore, most importantly, guarantees that we do not
> have
>   any wraparound as part of the buffer free process. This means no
> branches
>   for idx >= ring_size etc., and that we are free to treat the buffers
> to
>   be freed as being stored in a linear array.


OK.
Thank you for the detailed explanation.

Acked-by: Morten Brørup <[email protected]>

> 
> >
> > >
> > > Signed-off-by: Bruce Richardson <[email protected]>
> > > Acked-by: Anatoly Burakov <[email protected]>
> > > ---

RE: [PATCH v5 23/35] net/intel: use separate array for desc status tracking

Reply via email to