Re: [OMPI devel] Question about auto-vectorization behavior for OpenMPI OP components across architectures

'George Bosilca' via Open MPI devel Sat, 22 Feb 2025 14:38:45 -0800

Sorry for the late answer. Most of the things above are correct, when
building for a specific architecture the compiler does wonders, give or
take a few years. But as Gilles pointed out we are seeking the best
performance across different family of processors, so we helped the
compiler a little.


If I understand correctly it works on X86 but somehow we screwed up the ARM
part by not checking different sets of flags. One thing to notice is that
the paper mentioned here was from 2020 (which means the experiments were
certainly done in the 2019 timeframe), when few ARM processors were
available and the distros were distributing binaries compiled with a more
optimal set of flags. That has certainly changed which means the
configure.m4 for the sve op needs a well-deserved updated to mimic the x86
and provide a base version, a neon version and then maybe a few versions of
SVE (depending on the vector length).

Any contribution would be more than welcome, if you provide a patch I will
certainly be happy to review it.

Best,
  George.


On Fri, Jan 31, 2025 at 2:17 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Marco,
>
> these are some fair points, and I guess George (who initially authored
> this module iirc) will soon shed some light
>
> Cheers,
>
> Gilles
>
> On Fri, Jan 31, 2025 at 9:08 PM Marco Vogel <marco...@hotmail.de> wrote:
>
>> Gilles,
>>
>> Thank you for your response. I understand that distro-provided OpenMPI
>> binaries are typically built for broad compatibility, often targeting only
>> baseline instruction sets.
>>
>> For x86, this makes sense—if OpenMPI is compiled with a target
>> instruction set like `x86-64-v2` (no AVX), the `configure.m4` script for
>> the AVX component first attempts to compile AVX code directly. If that
>> fails, it retries with the necessary vectorization flags (e.g.,
>> `-mavx512f`, etc.). If successful, these flags are applied, ensuring that
>> vectorized functions are included in the AVX component. At runtime, OpenMPI
>> detects CPU capabilities (via CPUID) and uses the AVX functions when
>> available, even if vectorization wasn’t explicitly enabled by the package
>> maintainers - assuming I correctly understood the compilation process of
>> the OP components.
>>
>> What I find unclear is why the AArch64 component follows a different
>> approach. During configuration, it only checks whether the compiler can
>> compile NEON or SVE without additional flags. If not, the corresponding
>> intrinsic functions are omitted entirely. This means that if the distro
>> compilation settings don’t allow NEON or SVE, OpenMPI won’t include the
>> optimized functions, and processors with these vector units won’t benefit.
>> Conversely, if NEON or SVE is allowed, the base OPs will likely be
>> auto-vectorized, reducing the performance gap between the base and
>> intrinsic implementations.
>>
>> Is there a specific reason for this difference in handling SIMD support
>> between x86 and AArch64 in OpenMPI or am I wrong about the configuration
>> process?
>>
>> Cheers,
>>
>> Marco
>> On 31.01.25 11:47, Gilles Gouaillardet wrote:
>>
>> Marco,
>>
>> The compiler may vectorize if generating code optimised for a given
>> platform.
>> A distro provided Open MPI is likely to be optimised only for "common"
>> architectures (e.g. no AVX512 on x86 - SSE only? - and no SVE on aarch64)
>>
>> Cheers,
>>
>> Gilles
>>
>> On Fri, Jan 31, 2025, 18:06 Marco Vogel <marco...@hotmail.de> wrote:
>>
>>> Hello,
>>>
>>> I implemented a new OP component for OpenMPI targeting the RISC-V vector
>>> extension, following existing implementations for x86 (AVX) and ARM
>>> (NEON). During testing, I aimed to reproduce results from a paper
>>> discussing the AVX512 OP component, which stated that OpenMPI’s default
>>> compiler did not generate auto-vectorized code
>>> (https://icl.utk.edu/files/publications/2020/icl-utk-1416-2020.pdf
>>> Chapter 5 Experimental evaluation). However, on my Zen4 machine, I
>>> observed no performance difference between the AVX OP component and the
>>> base implementation (with --mca op ^avx) when running `MPI_Reduce_local`
>>> on a 1MB array.
>>> To investigate, I rebuilt OpenMPI with CFLAGS='-O3 -fno-tree-vectorize',
>>> which then confirmed the paper’s findings. This behavior is consistent
>>> across x86 (AVX), ARM (NEON) and RISC-V (RVV). My question: Did I
>>> overlook something in my testing or setup? Why wouldn’t the compiler in
>>> the paper auto-vectorize the base operations when mine allegedly does
>>> unless explicitly disabled?
>>>
>>> Thank you!
>>>
>>> Marco
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to devel+unsubscr...@lists.open-mpi.org.
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to devel+unsubscr...@lists.open-mpi.org.
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to devel+unsubscr...@lists.open-mpi.org.
>>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to devel+unsubscr...@lists.open-mpi.org.
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to devel+unsubscr...@lists.open-mpi.org.

Re: [OMPI devel] Question about auto-vectorization behavior for OpenMPI OP components across architectures

Reply via email to