Re: [OMPI devel] Question about auto-vectorization behavior for OpenMPI OP components across architectures

Marco Vogel Fri, 31 Jan 2025 04:08:52 -0800

Gilles,

Thank you for your response. I understand that distro-provided OpenMPIbinaries are typically built for broad compatibility, often targetingonly baseline instruction sets.

For x86, this makes sense—if OpenMPI is compiled with a targetinstruction set like `x86-64-v2` (no AVX), the `configure.m4` script forthe AVX component first attempts to compile AVX code directly. If thatfails, it retries with the necessary vectorization flags (e.g.,`-mavx512f`, etc.). If successful, these flags are applied, ensuringthat vectorized functions are included in the AVX component. At runtime,OpenMPI detects CPU capabilities (via CPUID) and uses the AVX functionswhen available, even if vectorization wasn’t explicitly enabled by thepackage maintainers - assuming I correctly understood the compilationprocess of the OP components.

What I find unclear is why the AArch64 component follows a differentapproach. During configuration, it only checks whether the compiler cancompile NEON or SVE without additional flags. If not, the correspondingintrinsic functions are omitted entirely. This means that if the distrocompilation settings don’t allow NEON or SVE, OpenMPI won’t include theoptimized functions, and processors with these vector units won’tbenefit. Conversely, if NEON or SVE is allowed, the base OPs will likelybe auto-vectorized, reducing the performance gap between the base andintrinsic implementations.

Is there a specific reason for this difference in handling SIMD supportbetween x86 and AArch64 in OpenMPI or am I wrong about the configurationprocess?


Cheers,

Marco

On 31.01.25 11:47, Gilles Gouaillardet wrote:


Marco,

The compiler may vectorize if generating code optimised for a givenplatform.A distro provided Open MPI is likely to be optimised only for "common"architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64)


Cheers,

Gilles


On Fri, Jan 31, 2025, 18:06 Marco Vogel <[email protected]> wrote:

    Hello,

    I implemented a new OP component for OpenMPI targeting the RISC-V
    vector
    extension, following existing implementations for x86 (AVX) and ARM
    (NEON). During testing, I aimed to reproduce results from a paper
    discussing the AVX512 OP component, which stated that OpenMPI’s
    default
    compiler did not generate auto-vectorized code
    (https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
    Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
    observed no performance difference between the AVX OP component
    and the
    base implementation (with --mca op ^avx) when running
    `MPI_Reduce_local`
    on a 1MB array.
    To investigate, I rebuilt OpenMPI with CFLAGS='-O3
    -fno-tree-vectorize',
    which then confirmed the paper’s findings. This behavior is
    consistent
    across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
    overlook something in my testing or setup? Why wouldn’t the
    compiler in
    the paper auto-vectorize the base operations when mine allegedly does
    unless explicitly disabled?

    Thank you!

    Marco

    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <mailto:devel%[email protected]>.

To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected].


To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

Re: [OMPI devel] Question about auto-vectorization behavior for OpenMPI OP components across architectures

Reply via email to