On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?

Auto-vectorization is working now with the following addition on Graviton 3 
(m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, 
auto-vectorization had no effect because we missed the -march=native option.

      encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

 buf   | default | auto_vec | SVE
--------+-------+--------+-------
     16 |     16  |      12  |    8
     64 |     58  |      40  |    9
    256 |    223  |     152  |   18
   1024 |    934  |     613  |   54
   4096 |   3533  |    2430  |  202
  16384 |  14081  |    9831  |  800
  65536 |  56374  |   38702  | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control 
flow.

-----
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all 
available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer 
advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We 
need a runtime check to be sure.. Using src/include/port/simd.h to abstract 
away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-----
Chiranmoy

Reply via email to