Sorry for the late answer. Most of the things above are correct, when building for a specific architecture the compiler does wonders, give or take a few years. But as Gilles pointed out we are seeking the best performance across different family of processors, so we helped the compiler a little.
If I understand correctly it works on X86 but somehow we screwed up the ARM part by not checking different sets of flags. One thing to notice is that the paper mentioned here was from 2020 (which means the experiments were certainly done in the 2019 timeframe), when few ARM processors were available and the distros were distributing binaries compiled with a more optimal set of flags. That has certainly changed which means the configure.m4 for the sve op needs a well-deserved updated to mimic the x86 and provide a base version, a neon version and then maybe a few versions of SVE (depending on the vector length). Any contribution would be more than welcome, if you provide a patch I will certainly be happy to review it. Best, George. On Fri, Jan 31, 2025 at 2:17 PM Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Marco, > > these are some fair points, and I guess George (who initially authored > this module iirc) will soon shed some light > > Cheers, > > Gilles > > On Fri, Jan 31, 2025 at 9:08 PM Marco Vogel <marco...@hotmail.de> wrote: > >> Gilles, >> >> Thank you for your response. I understand that distro-provided OpenMPI >> binaries are typically built for broad compatibility, often targeting only >> baseline instruction sets. >> >> For x86, this makes sense—if OpenMPI is compiled with a target >> instruction set like `x86-64-v2` (no AVX), the `configure.m4` script for >> the AVX component first attempts to compile AVX code directly. If that >> fails, it retries with the necessary vectorization flags (e.g., >> `-mavx512f`, etc.). If successful, these flags are applied, ensuring that >> vectorized functions are included in the AVX component. At runtime, OpenMPI >> detects CPU capabilities (via CPUID) and uses the AVX functions when >> available, even if vectorization wasn’t explicitly enabled by the package >> maintainers - assuming I correctly understood the compilation process of >> the OP components. >> >> What I find unclear is why the AArch64 component follows a different >> approach. During configuration, it only checks whether the compiler can >> compile NEON or SVE without additional flags. If not, the corresponding >> intrinsic functions are omitted entirely. This means that if the distro >> compilation settings don’t allow NEON or SVE, OpenMPI won’t include the >> optimized functions, and processors with these vector units won’t benefit. >> Conversely, if NEON or SVE is allowed, the base OPs will likely be >> auto-vectorized, reducing the performance gap between the base and >> intrinsic implementations. >> >> Is there a specific reason for this difference in handling SIMD support >> between x86 and AArch64 in OpenMPI or am I wrong about the configuration >> process? >> >> Cheers, >> >> Marco >> On 31.01.25 11:47, Gilles Gouaillardet wrote: >> >> Marco, >> >> The compiler may vectorize if generating code optimised for a given >> platform. >> A distro provided Open MPI is likely to be optimised only for "common" >> architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64) >> >> Cheers, >> >> Gilles >> >> On Fri, Jan 31, 2025, 18:06 Marco Vogel <marco...@hotmail.de> wrote: >> >>> Hello, >>> >>> I implemented a new OP component for OpenMPI targeting the RISC-V vector >>> extension, following existing implementations for x86 (AVX) and ARM >>> (NEON). During testing, I aimed to reproduce results from a paper >>> discussing the AVX512 OP component, which stated that OpenMPI’s default >>> compiler did not generate auto-vectorized code >>> (https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf >>> Chapter 5 Experimental evaluation). However, on my Zen4 machine, I >>> observed no performance difference between the AVX OP component and the >>> base implementation (with --mca op ^avx) when running `MPI_Reduce_local` >>> on a 1MB array. >>> To investigate, I rebuilt OpenMPI with CFLAGS='-O3 -fno-tree-vectorize', >>> which then confirmed the paper’s findings. This behavior is consistent >>> across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I >>> overlook something in my testing or setup? Why wouldn’t the compiler in >>> the paper auto-vectorize the base operations when mine allegedly does >>> unless explicitly disabled? >>> >>> Thank you! >>> >>> Marco >>> >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to devel+unsubscr...@lists.open-mpi.org. >>> >>> To unsubscribe from this group and stop receiving emails from it, send >> an email to devel+unsubscr...@lists.open-mpi.org. >> >> To unsubscribe from this group and stop receiving emails from it, send an >> email to devel+unsubscr...@lists.open-mpi.org. >> > To unsubscribe from this group and stop receiving emails from it, send an > email to devel+unsubscr...@lists.open-mpi.org. > To unsubscribe from this group and stop receiving emails from it, send an email to devel+unsubscr...@lists.open-mpi.org.