> From: Stephen Hemminger [mailto:[email protected]] > Sent: Thursday, 29 January 2026 02.06 > > On Wed, 28 Jan 2026 09:30:20 -0800 > Stephen Hemminger <[email protected]> wrote: > > > Implement the single/dual/quad loop design pattern from FD.IO VPP to > > improve cache efficiency in the af_packet PMD receive path. > > > > The original implementation processes packets one at a time in a > simple > > loop, which can result in cache misses when accessing frame headers > and > > packet data. The new implementation: > > > > - Processes packets in batches of 4 (quad), 2 (dual), and 1 (single) > > - Prefetches next batch of frame headers while processing current > batch > > - Prefetches packet data before memcpy to hide memory latency > > - Reduces loop overhead through partial unrolling > > > > Two helper functions are introduced: > > - af_packet_get_frame(): Returns frame pointer at index with > wraparound > > - af_packet_rx_one(): Common per-packet processing (mbuf alloc, > memcpy, > > VLAN handling, timestamp offload) > > > > The quad loop checks availability of all 4 frames before processing, > > falling through to dual/single loops when fewer frames are ready. > Early > > exit paths (out_advance1/2/3) ensure correct frame index tracking > when > > mbuf allocation fails mid-batch. > > > > Prefetch strategy: > > - Frame headers: prefetch N+4..N+7 while processing N..N+3 > > - Packet data: prefetch at tp_mac offset before memcpy > > > > This pattern is well-established in high-performance packet > processing > > and should improve throughput by better utilizing CPU cache > hierarchy, > > particularly beneficial when processing bursts of packets. > > > > Signed-off-by: Stephen Hemminger <[email protected]> > > > This and previous proposal to prefetch have no impact on performance. > Rolled a simple perf test and all three versions come out the same.
Please be aware that many test cases are inadvertently designed in a way where mbufs unintendedly are hot in the cache, so prefetching does not provide the expected performance gain. E.g. when working on a newly allocated mbuf, the mbuf should be cold. But if it came from the mempool cache, and was recently worked on and then freed into the mempool cache, then it will be hot. > The bottleneck is not here, probably at system call and copies now. The most important bottleneck might be elsewhere. But this optimization might not be as irrelevant as the test results indicate. Anyway, I agree that dropping the patch (for now) makes sense. > > Original Prefetch Quad/Dual > TX 1.427 Mpps 1.426 Mpps 1.426 Mpps > > RX 0.529 Mpps 0.530 Mpps 0.533 Mpps > loss 87.93% 87.98% 88.0% > > > Original Prefetch Quad/Dual > TX 1.427 Mpps 1.426 Mpps 1.426 Mpps > > RX 0.529 Mpps 0.530 Mpps 0.533 Mpps > loss 87.93% 87.98% 88.0% > > > Will put the test in the next version of this series, and > drop this patch.

