On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote: > Do you mean that the auto-vectorization worked and you observed no > performance improvement, or the auto-vectorization had no effect on the > code generated?
Auto-vectorization is working now with the following addition on Graviton 3 (m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, auto-vectorization had no effect because we missed the -march=native option. encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native There is a 30% improvement using auto-vectorization. buf | default | auto_vec | SVE --------+-------+--------+------- 16 | 16 | 12 | 8 64 | 58 | 40 | 9 256 | 223 | 152 | 18 1024 | 934 | 613 | 54 4096 | 3533 | 2430 | 202 16384 | 14081 | 9831 | 800 65536 | 56374 | 38702 | 3202 Auto-vectorization had no effect on hex_decode due to the presence of control flow. ----- Here is a comment snippet from src/include/port/simd.h "While Neon support is technically optional for aarch64, it appears that all available 64-bit hardware does have it." Currently, it is assumed that all aarch64 machine support NEON, but for newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We need a runtime check to be sure.. Using src/include/port/simd.h to abstract away these advanced SIMD implementations may be difficult. We will update the thread once a solution is found. ----- Chiranmoy