Hello,
I implemented a new OP component for OpenMPI targeting the RISC-V vector
extension, following existing implementations for x86 (AVX) and ARM
(NEON). During testing, I aimed to reproduce results from a paper
discussing the AVX512 OP component, which stated that OpenMPI’s default
compiler did not generate auto-vectorized code
(https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
observed no performance difference between the AVX OP component and the
base implementation (with --mca op ^avx) when running `MPI_Reduce_local`
on a 1MB array.
To investigate, I rebuilt OpenMPI with CFLAGS='-O3 -fno-tree-vectorize',
which then confirmed the paper’s findings. This behavior is consistent
across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
overlook something in my testing or setup? Why wouldn’t the compiler in
the paper auto-vectorize the base operations when mine allegedly does
unless explicitly disabled?
Thank you!
Marco
To unsubscribe from this group and stop receiving emails from it, send an email
to devel+unsubscr...@lists.open-mpi.org.