> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Wednesday, 27 July 2022 19.38 >
[...] > > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's > supposed > > > to be arch > > > > specific limitation, that we probably want to hide, no? > > > > Correct. However, optional hints for optimization purposes will be > available. > > And it is up to the architecture specific implementation to make the > best use > > of these hints, or just ignore them. > > > > > > Inside the function can check alignment of both src and dst and > > > decide should it > > > > use NT load/store instructions or just do normal copy. > > > IMO, the normal copy should not be done by this API under any > > > conditions. Why not let the application call memcpy/rte_memcpy when > > > the NT copy is not applicable? It helps the programmer to > understand > > > and debug the issues much easier. > > > > Yes, the programmer must choose between normal memcpy() and non- > > temporal rte_memcpy_nt(). I am offering new functions, not modifying > > memcpy() or rte_memcpy(). > > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if > non- > > temporal copying is unavailable, e.g. on POWER and RISC-V > architectures, > > which don't have NT load/store instructions. > I am talking about a scenario where the application is being ported > between architectures. Not everyone knows about the capabilities of the > architecture. It is better to indicate upfront (ex: compilation > failures) that a certain feature is not supported on the target > architecture rather than the user having to discover through painful > debugging. I'm considering rte_memcpy_nt() a performance optimized variant of memcpy(), where the performance gain is less cache pollution. Thus, silent fallback to memcpy() should suffice. Other architecture differences also affect DPDK performance; the inability to perform non-temporal load/store just one more to the (undocumented) list. Failing at build time if NT load/store is unavailable by the architecture would prevent the function from being used by other DPDK libraries, e.g. by the rte_pktmbuf_copy() function used by the pdump library. I don't oppose to your idea, I just don't have any idea how to reasonably implement it. So I'm trying to defend why it is not important.