> From: Scott <[email protected]> > > __rte_raw_cksum uses a loop with memcpy on each iteration. > GCC 15+ is able to vectorize the loop but Clang 18.1 is not. > Replacing the memcpy with unaligned_uint16_t pointer access enables > both GCC and Clang to vectorize with SSE/AVX/AVX-512. > > This patch adds comprehensive fuzz testing and updates the performance > test to measure the optimization impact. > > Performance results from cksum_perf_autotest on Intel Xeon > (Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte): > > Block size Before After Improvement > 100 0.40 0.24 ~40% > 1500 0.50 0.06 ~8x > 9000 0.49 0.06 ~8x > > Signed-off-by: Scott Mitchell <[email protected]> > ---
Probably makes no practical difference, but consider marking the __rte_raw_cksum() function __rte_pure: https://elixir.bootlin.com/dpdk/v25.11/source/lib/eal/include/rte_common.h#L228 With or without __rte_pure marking, Acked-by: Morten Brørup <[email protected]>

