https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125654
Bug ID: 125654
Summary: [15 regression] Wrong code with AVX2/FMA
vectorization: vfmaddsub132pd used with incorrect
multiplier
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: liuxin24 at iscas dot ac.cn
Target Milestone: ---
Created attachment 64665
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=64665&action=edit
a minimal case
GCC 15.2.0 miscompiles a small C++ test case when compiled with
-O3 -march=native on an x86_64 target with AVX2/FMA support. The
vectorizer combines symmetric "center +/- half_range" computations
for two axes and emits a vfmaddsub132pd instruction, but uses the
"center" register as the FMA multiplier instead of 1.0, producing
completely wrong numerical results.
Steps to Reproduce:
1. Compile the attached repro_minimal.ii with:
g++ -O3 -march=native -o repro_bug repro_minimal.ii
2. Run the binary:
./repro_bug
Expected Result:
-2.5000 12.5000 2.5000 17.5000
(and exit status 0)
Actual Result:
17.5000 61.2500 92.5000 66.2500
(and exit status 1)
Workaround:
Adding -fno-tree-vectorize produces the correct result:
g++ -O3 -march=native -fno-tree-vectorize -o repro_ok repro_minimal.ii
./repro_ok
# => -2.5000 12.5000 2.5000 17.5000
Regression Test Results (via Compiler Explorer, -O3 -march=native):
x86-64 gcc 13.1 -> OK (-2.5000 12.5000 2.5000 17.5000)
x86-64 gcc 13.2 -> OK
x86-64 gcc 13.3 -> OK
x86-64 gcc 13.4 -> OK
x86-64 gcc 14.1 -> OK
x86-64 gcc 14.2 -> OK
x86-64 gcc 14.3 -> OK
x86-64 gcc 15.1 -> BUG (17.5000 61.2500 92.5000 66.2500)
x86-64 gcc 15.2 -> BUG (17.5000 61.2500 92.5000 66.2500)
x86-64 gcc 16.1 -> OK (-2.5000 12.5000 2.5000 17.5000)
This bug was introduced in GCC 15.1 and is still present in 15.2.
It is already fixed on the GCC 16 branch (16.1), but the fix needs
to be backported to the active GCC 15 release branch.
Assembly Analysis (expandRange):
With -O3 -march=native, GCC emits the following (buggy) sequence
for the body of Box::expandRange():
vmovsd (%rdi), %xmm4 # uMin
vmovsd 8(%rdi), %xmm0 # uMax
vmovsd 24(%rdi), %xmm1 # vMax
vmovsd 16(%rdi), %xmm5 # vMin
vsubsd %xmm4, %xmm0, %xmm2 # uRange = uMax - uMin
vaddsd %xmm4, %xmm0, %xmm0 # uMax + uMin
vmovsd .LC0(%rip), %xmm4 # 1.5
vsubsd %xmm5, %xmm1, %xmm3 # vRange = vMax - vMin
vaddsd %xmm5, %xmm1, %xmm1 # vMax + vMin
vmulsd %xmm4, %xmm2, %xmm2 # uRange *= 1.5
vmulsd %xmm4, %xmm3, %xmm3 # vRange *= 1.5
vunpcklpd %xmm2, %xmm0, %xmm0 # [uMax+uMin, uRange*1.5]
vunpcklpd %xmm3, %xmm1, %xmm1 # [vMax+vMin, vRange*1.5]
vinsertf64x2 $0x1, %xmm1, %ymm0, %ymm0
vmulpd .LC2(%rip){1to4}, %ymm0, %ymm0
# ymm0 = [uCenter, uRange/2*1.5,
# vCenter, vRange/2*1.5]
vpermilpd $5, %ymm0, %ymm1 # ymm1 = [uRange/2*1.5, uCenter,
# vRange/2*1.5, vCenter]
vfmaddsub132pd %ymm0, %ymm1, %ymm0 # BUG: uses ymm0 as multiplier
# instead of a vector of 1.0s.
#
# vfmaddsub132pd computes:
# dst[i] = dst[i]*src3[i] +/- src2[i]
# Here src3 == dst == ymm0, so:
# ymm0[0] = uCenter*uCenter - half
# ymm0[1] = half*uCenter + uCenter
# instead of the intended:
# ymm0[0] = uCenter - half
# ymm0[1] = half + uCenter
vmovupd %ymm0, (%rdi)
The intended semantics for combining center +/- half_range with
vfmaddsub132pd requires a multiplier of 1.0 (so the instruction
reduces to center +/- half_range). GCC has lost the 1.0 multiplier
and reused the center value, producing quadratic terms instead of
the linear ones expected.
Environment:
GCC version: 15.2.0 (Ubuntu 15.2.0-16ubuntu1)
Target: x86_64-linux-gnu
CPU: Supports AVX2, FMA (detected by -march=native as znver4)
Configured with:
../src/configure -v --with-pkgversion='Ubuntu 15.2.0-16ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-15/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2,rust,cobol,algol68
--prefix=/usr --with-gcc-major-version-only --program-suffix=-15
--program-prefix=x86_64-linux-gnu- --enable-shared
--enable-linker-build-id --libexecdir=/usr/libexec
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-15-j35TAX/gcc-15-15.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-15-j35TAX/gcc-15-15.2.0/debian/tmp-gcn/usr
--enable-offload-defaulted --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Command line that triggers the bug:
g++ -O3 -march=native -o repro_bug repro_minimal.ii
Compiler output: no warnings or errors; binary runs but produces wrong output.
Known To Work: 13.1, 13.2, 13.3, 13.4, 14.1, 14.2, 14.3, 16.1
Known To Fail: 15.1, 15.2