RE: [PATCH v2] eal/x86: optimize memcpy of small sizes

Morten Brørup Fri, 21 Nov 2025 13:36:41 -0800

> From: Stephen Hemminger [mailto:[email protected]]
> Sent: Friday, 21 November 2025 18.12
> 
> On Fri, 21 Nov 2025 17:02:17 +0000
> Bruce Richardson <[email protected]> wrote:
> 
> > > As I have said before would rather that DPDK move away from having
> its
> > > own specialized memcpy.  How is this compared to stock inline gcc?


The "./build/app/test/dpdk-test memcpy_perf_autotest" compares to standard 
memcpy().

On my build system, copies up to 64 bytes (with size not known at build time) 
take 9 cycles using memcpy() vs. 2-4 cycles using rte_memcpy().

The general difference was probably worse with older compilers.
We should compare using the oldest compiler versions officially supported by 
DPDK. (GCC, Clang, MSVC, ...) And across the supported CPUs.

There are plenty of optimizations in DPDK, which were relevant at the time of 
addition, but have become obsolete over time.
I don't think rte_memcpy() is there yet. (Gut feeling, no data to back it up 
with!)
Until we get there, we should keep optimizing rte_memcpy().

For any per-packet operation, shaving off a few cycles is valuable.
And if the majority of an application's copy operations per packet are more 
than a few bytes, the application will not achieve high performance.
Thus, I think optimizing small copies is relevant: A normal DPDK application 
should perform many more small copies than large copies. (Measured by number of 
copy operations, not number of copied bytes.)

> > > The main motivation is that the glibc/gcc team does more testing
> across
> > > multiple architectures and has a community with more expertise on
> CPU
> > > special cases.
> >
> > I would tend to agree. Even if we get rte_memcpy a few cycles faster,
> I
> > suspect many apps wouldn't notice the difference. However, I
> understand
> > that the virtio/vhost libraries gain from using rte_memcpy over
> standard
> > memcpy - or at least used to. Perhaps we can consider deprecating
> > rte_memcpy and just putting a vhost-specific memcpy in that library?
> 
> It would be good to figure out why vhost is better with rte_memcpy,
> maybe there is some alignment assumption that is in one and not the
> other?

Looking at 1024 bytes copy on my build system,
cache-to-mem is 12 % faster with rte_memcpy(), and
mem-to-cache is 10 % slower.

Maybe the vhost library would benefit from having access to two rte_memcpy 
variants, respectively optimized for cache-to-mem and mem-to-cache.

There will always be some use cases where a generic "optimized" rte_memcpy() 
will be suboptimal.

Providing specific functions optimized for specific use cases makes really good 
sense.

RE: [PATCH v2] eal/x86: optimize memcpy of small sizes

Reply via email to