TX rte_memcpy, bulk free, prefetch

Stephen Hemminger Tue, 27 Jan 2026 10:54:58 -0800

On Tue, 27 Jan 2026 10:13:54 -0800
[email protected] wrote:

> From: Scott Mitchell <[email protected]>
> 
> - Add rte_prefetch0() to prefetch next frame/mbuf while processing
>   current packet, reducing cache miss latency


Makes sense, if you really want to dive deeper there are more
unrolled loops patterns possible; there was a multi-step unrolled
loop pattern that fd.io does. The reason is that the first pre-fetch
is usually useless and doesn't help but skipping ahead farther
helps.

> - Replace memcpy() with rte_memcpy() for optimized copy operations
There is no good reason that rte_memcpy() should be faster than memcpy().
There were some cases observed with virtio but my hunch is that this is
because the two routines are making different alignment assumptions.

> - Use rte_pktmbuf_free_bulk() in TX path instead of individual
>   rte_pktmbuf_free() calls for better batch efficiency
Makes sense.

> - Add unlikely() hints for error paths (oversized packets, VLAN
>   insertion failures, sendto errors) to optimize branch prediction
Also makes sense.

> - Remove unnecessary early nb_pkts == 0 when loop handles this
>   and app may never call with 0 frames.

Yes calling with nb_pkts == 0 on tx/rx burst only needs to work
does not need short circuit.

> Signed-off-by: Scott Mitchell <[email protected]>

Re: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch

Reply via email to