Hello,

I implemented a new OP component for OpenMPI targeting the RISC-V vector extension, following existing implementations for x86 (AVX) and ARM (NEON). During testing, I aimed to reproduce results from a paper discussing the AVX512 OP component, which stated that OpenMPI’s default compiler did not generate auto-vectorized code (https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf Chapter 5 Experimental evaluation). However, on my Zen4 machine, I observed no performance difference between the AVX OP component and the base implementation (with --mca op ^avx) when running `MPI_Reduce_local` on a 1MB array. To investigate, I rebuilt OpenMPI with CFLAGS='-O3 -fno-tree-vectorize', which then confirmed the paper’s findings. This behavior is consistent across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I overlook something in my testing or setup? Why wouldn’t the compiler in the paper auto-vectorize the base operations when mine allegedly does unless explicitly disabled?

Thank you!

Marco

To unsubscribe from this group and stop receiving emails from it, send an email 
to devel+unsubscr...@lists.open-mpi.org.

Reply via email to