Re: [PATCH v2] eal/x86: optimize memcpy of small sizes

Bruce Richardson Fri, 21 Nov 2025 09:02:41 -0800

On Fri, Nov 21, 2025 at 08:57:30AM -0800, Stephen Hemminger wrote:
> On Fri, 21 Nov 2025 10:35:35 +0000
> Morten Brørup <[email protected]> wrote:
> 
> > The implementation for copying up to 64 bytes does not depend on address
> > alignment with the size of the CPU's vector registers, so the code
> > handling this was moved from the various implementations to the common
> > function.
> > 
> > Furthermore, the function for copying less than 16 bytes was replaced with
> > a smarter implementation using fewer branches and potentially fewer
> > load/store operations.
> > This function was also extended to handle copying of up to 16 bytes,
> > instead of up to 15 bytes. This small extension reduces the code path for
> > copying two pointers.
> > 
> > These changes provide two benefits:
> > 1. The memory footprint of the copy function is reduced.
> > Previously there were two instances of the compiled code to copy up to 64
> > bytes, one in the "aligned" code path, and one in the "generic" code path.
> > Now there is only one instance, in the "common" code path.
> > 2. The performance for copying up to 64 bytes is improved.
> > The memcpy performance test shows cache-to-cache copying of up to 32 bytes
> > now typically only takes 2 cycles (4 cycles for 64 bytes) versus
> > ca. 6.5 cycles before this patch.
> > 
> > And finally, the missing implementation of rte_mov48() was added.
> > 
> > Signed-off-by: Morten Brørup <[email protected]>
> 
> As I have said before would rather that DPDK move away from having its
> own specialized memcpy.  How is this compared to stock inline gcc?
> The main motivation is that the glibc/gcc team does more testing across
> multiple architectures and has a community with more expertise on CPU
> special cases.


I would tend to agree. Even if we get rte_memcpy a few cycles faster, I
suspect many apps wouldn't notice the difference. However, I understand
that the virtio/vhost libraries gain from using rte_memcpy over standard
memcpy - or at least used to. Perhaps we can consider deprecating
rte_memcpy and just putting a vhost-specific memcpy in that library?

/Bruce

Re: [PATCH v2] eal/x86: optimize memcpy of small sizes

Reply via email to