Gilles,

Thank you for your response. I understand that distro-provided OpenMPI binaries are typically built for broad compatibility, often targeting only baseline instruction sets.

For x86, this makes sense—if OpenMPI is compiled with a target instruction set like `x86-64-v2` (no AVX), the `configure.m4` script for the AVX component first attempts to compile AVX code directly. If that fails, it retries with the necessary vectorization flags (e.g., `-mavx512f`, etc.). If successful, these flags are applied, ensuring that vectorized functions are included in the AVX component. At runtime, OpenMPI detects CPU capabilities (via CPUID) and uses the AVX functions when available, even if vectorization wasn’t explicitly enabled by the package maintainers - assuming I correctly understood the compilation process of the OP components.

What I find unclear is why the AArch64 component follows a different approach. During configuration, it only checks whether the compiler can compile NEON or SVE without additional flags. If not, the corresponding intrinsic functions are omitted entirely. This means that if the distro compilation settings don’t allow NEON or SVE, OpenMPI won’t include the optimized functions, and processors with these vector units won’t benefit. Conversely, if NEON or SVE is allowed, the base OPs will likely be auto-vectorized, reducing the performance gap between the base and intrinsic implementations.

Is there a specific reason for this difference in handling SIMD support between x86 and AArch64 in OpenMPI or am I wrong about the configuration process?

Cheers,

Marco

On 31.01.25 11:47, Gilles Gouaillardet wrote:

Marco,

The compiler may vectorize if generating code optimised for a given platform. A distro provided Open MPI is likely to be optimised only for "common" architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64)

Cheers,

Gilles


On Fri, Jan 31, 2025, 18:06 Marco Vogel <marco...@hotmail.de> wrote:

    Hello,

    I implemented a new OP component for OpenMPI targeting the RISC-V
    vector
    extension, following existing implementations for x86 (AVX) and ARM
    (NEON). During testing, I aimed to reproduce results from a paper
    discussing the AVX512 OP component, which stated that OpenMPI’s
    default
    compiler did not generate auto-vectorized code
    (https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
    Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
    observed no performance difference between the AVX OP component
    and the
    base implementation (with --mca op ^avx) when running
    `MPI_Reduce_local`
    on a 1MB array.
    To investigate, I rebuilt OpenMPI with CFLAGS='-O3
    -fno-tree-vectorize',
    which then confirmed the paper’s findings. This behavior is
    consistent
    across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
    overlook something in my testing or setup? Why wouldn’t the
    compiler in
    the paper auto-vectorize the base operations when mine allegedly does
    unless explicitly disabled?

    Thank you!

    Marco

    To unsubscribe from this group and stop receiving emails from it,
    send an email to devel+unsubscr...@lists.open-mpi.org
    <mailto:devel%2bunsubscr...@lists.open-mpi.org>.

To unsubscribe from this group and stop receiving emails from it, send an email to devel+unsubscr...@lists.open-mpi.org.

To unsubscribe from this group and stop receiving emails from it, send an email 
to devel+unsubscr...@lists.open-mpi.org.

Reply via email to