Gilles,
Thank you for your response. I understand that distro-provided OpenMPI
binaries are typically built for broad compatibility, often targeting
only baseline instruction sets.
For x86, this makes sense—if OpenMPI is compiled with a target
instruction set like `x86-64-v2` (no AVX), the `configure.m4` script for
the AVX component first attempts to compile AVX code directly. If that
fails, it retries with the necessary vectorization flags (e.g.,
`-mavx512f`, etc.). If successful, these flags are applied, ensuring
that vectorized functions are included in the AVX component. At runtime,
OpenMPI detects CPU capabilities (via CPUID) and uses the AVX functions
when available, even if vectorization wasn’t explicitly enabled by the
package maintainers - assuming I correctly understood the compilation
process of the OP components.
What I find unclear is why the AArch64 component follows a different
approach. During configuration, it only checks whether the compiler can
compile NEON or SVE without additional flags. If not, the corresponding
intrinsic functions are omitted entirely. This means that if the distro
compilation settings don’t allow NEON or SVE, OpenMPI won’t include the
optimized functions, and processors with these vector units won’t
benefit. Conversely, if NEON or SVE is allowed, the base OPs will likely
be auto-vectorized, reducing the performance gap between the base and
intrinsic implementations.
Is there a specific reason for this difference in handling SIMD support
between x86 and AArch64 in OpenMPI or am I wrong about the configuration
process?
Cheers,
Marco
On 31.01.25 11:47, Gilles Gouaillardet wrote:
Marco,
The compiler may vectorize if generating code optimised for a given
platform.
A distro provided Open MPI is likely to be optimised only for "common"
architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64)
Cheers,
Gilles
On Fri, Jan 31, 2025, 18:06 Marco Vogel <marco...@hotmail.de> wrote:
Hello,
I implemented a new OP component for OpenMPI targeting the RISC-V
vector
extension, following existing implementations for x86 (AVX) and ARM
(NEON). During testing, I aimed to reproduce results from a paper
discussing the AVX512 OP component, which stated that OpenMPI’s
default
compiler did not generate auto-vectorized code
(https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
observed no performance difference between the AVX OP component
and the
base implementation (with --mca op ^avx) when running
`MPI_Reduce_local`
on a 1MB array.
To investigate, I rebuilt OpenMPI with CFLAGS='-O3
-fno-tree-vectorize',
which then confirmed the paper’s findings. This behavior is
consistent
across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
overlook something in my testing or setup? Why wouldn’t the
compiler in
the paper auto-vectorize the base operations when mine allegedly does
unless explicitly disabled?
Thank you!
Marco
To unsubscribe from this group and stop receiving emails from it,
send an email to devel+unsubscr...@lists.open-mpi.org
<mailto:devel%2bunsubscr...@lists.open-mpi.org>.
To unsubscribe from this group and stop receiving emails from it, send
an email to devel+unsubscr...@lists.open-mpi.org.
To unsubscribe from this group and stop receiving emails from it, send an email
to devel+unsubscr...@lists.open-mpi.org.