> > - Add rte_prefetch0() to prefetch next frame/mbuf while processing > > current packet, reducing cache miss latency > > Makes sense, if you really want to dive deeper there are more > unrolled loops patterns possible; there was a multi-step unrolled > loop pattern that fd.io does. The reason is that the first pre-fetch > is usually useless and doesn't help but skipping ahead farther > helps.
I didn't want to go too overboard and there are trade-offs (fetch too much may evict entries you need). The upcoming GRO support (in follow-up series) enables ~64k+ payloads which increases the memory footprint per packet. Would you prefer I remove prefetch+1 or OK to keep? > > - Replace memcpy() with rte_memcpy() for optimized copy operations > There is no good reason that rte_memcpy() should be faster than memcpy(). > There were some cases observed with virtio but my hunch is that this is > because the two routines are making different alignment assumptions. ack. I will drop rte_memcpy. Under what scenarios is rte_memcpy preferred/beneficial?

