> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Wednesday, 27 July 2022 19.38
> 

[...]

> >
> > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> supposed
> > > to be arch
> > > > specific limitation, that we probably want to hide, no?
> >
> > Correct. However, optional hints for optimization purposes will be
> available.
> > And it is up to the architecture specific implementation to make the
> best use
> > of these hints, or just ignore them.
> >
> > > > Inside the function can check alignment of both src and dst and
> > > decide should it
> > > > use NT load/store instructions or just do normal copy.
> > > IMO, the normal copy should not be done by this API under any
> > > conditions. Why not let the application call memcpy/rte_memcpy when
> > > the NT copy is not applicable? It helps the programmer to
> understand
> > > and debug the issues much easier.
> >
> > Yes, the programmer must choose between normal memcpy() and non-
> > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > memcpy() or rte_memcpy().
> >
> > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> non-
> > temporal copying is unavailable, e.g. on POWER and RISC-V
> architectures,
> > which don't have NT load/store instructions.
> I am talking about a scenario where the application is being ported
> between architectures. Not everyone knows about the capabilities of the
> architecture. It is better to indicate upfront (ex: compilation
> failures) that a certain feature is not supported on the target
> architecture rather than the user having to discover through painful
> debugging.

I'm considering rte_memcpy_nt() a performance optimized variant of memcpy(), 
where the performance gain is less cache pollution. Thus, silent fallback to 
memcpy() should suffice.

Other architecture differences also affect DPDK performance; the inability to 
perform non-temporal load/store just one more to the (undocumented) list.

Failing at build time if NT load/store is unavailable by the architecture would 
prevent the function from being used by other DPDK libraries, e.g. by the 
rte_pktmbuf_copy() function used by the pdump library.

I don't oppose to your idea, I just don't have any idea how to reasonably 
implement it. So I'm trying to defend why it is not important.

Reply via email to