> > - Add rte_prefetch0() to prefetch next frame/mbuf while processing
> >   current packet, reducing cache miss latency
>
> Makes sense, if you really want to dive deeper there are more
> unrolled loops patterns possible; there was a multi-step unrolled
> loop pattern that fd.io does. The reason is that the first pre-fetch
> is usually useless and doesn't help but skipping ahead farther
> helps.

I didn't want to go too overboard and there are trade-offs (fetch too
much may evict entries you need). The upcoming GRO support (in
follow-up series) enables ~64k+ payloads which increases the memory
footprint per packet. Would you prefer I remove prefetch+1 or OK to
keep?

> > - Replace memcpy() with rte_memcpy() for optimized copy operations
> There is no good reason that rte_memcpy() should be faster than memcpy().
> There were some cases observed with virtio but my hunch is that this is
> because the two routines are making different alignment assumptions.

ack. I will drop rte_memcpy. Under what scenarios is rte_memcpy
preferred/beneficial?

Reply via email to