FMA vectorization: vfmaddsub132pd used with incorrect multiplier

liuxin24 at iscas dot ac.cn via Gcc-bugs Sun, 07 Jun 2026 18:33:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125654


            Bug ID: 125654
           Summary: [15 regression] Wrong code with AVX2/FMA
                    vectorization: vfmaddsub132pd used with incorrect
                    multiplier
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: liuxin24 at iscas dot ac.cn
  Target Milestone: ---

Created attachment 64665
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64665&action=edit
a minimal case

GCC 15.2.0 miscompiles a small C++ test case when compiled with
-O3 -march=native on an x86_64 target with AVX2/FMA support. The
vectorizer combines symmetric "center +/- half_range" computations
for two axes and emits a vfmaddsub132pd instruction, but uses the
"center" register as the FMA multiplier instead of 1.0, producing
completely wrong numerical results.

Steps to Reproduce:
1. Compile the attached repro_minimal.ii with:

   g++ -O3 -march=native -o repro_bug repro_minimal.ii

2. Run the binary:

   ./repro_bug

Expected Result:
-2.5000 12.5000 2.5000 17.5000
(and exit status 0)

Actual Result:
17.5000 61.2500 92.5000 66.2500
(and exit status 1)

Workaround:
Adding -fno-tree-vectorize produces the correct result:

   g++ -O3 -march=native -fno-tree-vectorize -o repro_ok repro_minimal.ii
   ./repro_ok
   # => -2.5000 12.5000 2.5000 17.5000

Regression Test Results (via Compiler Explorer, -O3 -march=native):

   x86-64 gcc 13.1  -> OK   (-2.5000 12.5000 2.5000 17.5000)
   x86-64 gcc 13.2  -> OK
   x86-64 gcc 13.3  -> OK
   x86-64 gcc 13.4  -> OK
   x86-64 gcc 14.1  -> OK
   x86-64 gcc 14.2  -> OK
   x86-64 gcc 14.3  -> OK
   x86-64 gcc 15.1  -> BUG  (17.5000 61.2500 92.5000 66.2500)
   x86-64 gcc 15.2  -> BUG  (17.5000 61.2500 92.5000 66.2500)
   x86-64 gcc 16.1  -> OK   (-2.5000 12.5000 2.5000 17.5000)

This bug was introduced in GCC 15.1 and is still present in 15.2.
It is already fixed on the GCC 16 branch (16.1), but the fix needs
to be backported to the active GCC 15 release branch.

Assembly Analysis (expandRange):
With -O3 -march=native, GCC emits the following (buggy) sequence
for the body of Box::expandRange():

   vmovsd       (%rdi), %xmm4          # uMin
   vmovsd       8(%rdi), %xmm0         # uMax
   vmovsd       24(%rdi), %xmm1        # vMax
   vmovsd       16(%rdi), %xmm5        # vMin
   vsubsd       %xmm4, %xmm0, %xmm2    # uRange = uMax - uMin
   vaddsd       %xmm4, %xmm0, %xmm0    # uMax + uMin
   vmovsd       .LC0(%rip), %xmm4      # 1.5
   vsubsd       %xmm5, %xmm1, %xmm3    # vRange = vMax - vMin
   vaddsd       %xmm5, %xmm1, %xmm1    # vMax + vMin
   vmulsd       %xmm4, %xmm2, %xmm2    # uRange *= 1.5
   vmulsd       %xmm4, %xmm3, %xmm3    # vRange *= 1.5
   vunpcklpd    %xmm2, %xmm0, %xmm0    # [uMax+uMin, uRange*1.5]
   vunpcklpd    %xmm3, %xmm1, %xmm1    # [vMax+vMin, vRange*1.5]
   vinsertf64x2 $0x1, %xmm1, %ymm0, %ymm0
   vmulpd       .LC2(%rip){1to4}, %ymm0, %ymm0
                                       # ymm0 = [uCenter, uRange/2*1.5,
                                       #         vCenter, vRange/2*1.5]
   vpermilpd    $5, %ymm0, %ymm1       # ymm1 = [uRange/2*1.5, uCenter,
                                       #         vRange/2*1.5, vCenter]
   vfmaddsub132pd %ymm0, %ymm1, %ymm0  # BUG: uses ymm0 as multiplier
                                       # instead of a vector of 1.0s.
                                       #
                                       # vfmaddsub132pd computes:
                                       #   dst[i] = dst[i]*src3[i] +/- src2[i]
                                       # Here src3 == dst == ymm0, so:
                                       #   ymm0[0] = uCenter*uCenter - half
                                       #   ymm0[1] = half*uCenter + uCenter
                                       # instead of the intended:
                                       #   ymm0[0] = uCenter - half
                                       #   ymm0[1] = half + uCenter
   vmovupd      %ymm0, (%rdi)

The intended semantics for combining center +/- half_range with
vfmaddsub132pd requires a multiplier of 1.0 (so the instruction
reduces to center +/- half_range). GCC has lost the 1.0 multiplier
and reused the center value, producing quadratic terms instead of
the linear ones expected.

Environment:
GCC version: 15.2.0 (Ubuntu 15.2.0-16ubuntu1)
Target:      x86_64-linux-gnu
CPU:         Supports AVX2, FMA (detected by -march=native as znver4)
Configured with:
../src/configure -v --with-pkgversion='Ubuntu 15.2.0-16ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-15/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust,cobol,algol68
--prefix=/usr --with-gcc-major-version-only --program-suffix=-15
--program-prefix=x86_64-linux-gnu- --enable-shared
--enable-linker-build-id --libexecdir=/usr/libexec
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-15-j35TAX/gcc-15-15.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-15-j35TAX/gcc-15-15.2.0/debian/tmp-gcn/usr
--enable-offload-defaulted --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2

Command line that triggers the bug:
g++ -O3 -march=native -o repro_bug repro_minimal.ii

Compiler output: no warnings or errors; binary runs but produces wrong output.

Known To Work: 13.1, 13.2, 13.3, 13.4, 14.1, 14.2, 14.3, 16.1
Known To Fail: 15.1, 15.2

[Bug tree-optimization/125654] New: [15 regression] Wrong code with AVX2/FMA vectorization: vfmaddsub132pd used with incorrect multiplier

Reply via email to