https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl.tools at gmail dot com
--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with -Ofast and -mprefer-vector-width=256 I get
<bb 2> [local count: 63136019]:
vect__4.2_3 = MEM[(float *)a_11(D)];
vect__5.3_4 = RSQRT (vect__4.2_3);
MEM[(float *)r_12(D)] = vect__5.3_4;
vect__4.2_21 = MEM[(float *)a_11(D) + 32B];
vect__5.3_20 = RSQRT (vect__4.2_21);
MEM[(float *)r_12(D) + 32B] = vect__5.3_20;
while with -mprefer-vector-width=512 I need -mavx512er to trigger the
expander, then I also get
<bb 2> [local count: 63136020]:
vect__4.2_21 = MEM[(float *)a_11(D)];
vect__5.3_20 = RSQRT (vect__4.2_21);
MEM[(float *)r_12(D)] = vect__5.3_20;
and in that case
rsqrt:
.LFB12:
.cfi_startproc
vrsqrt28ps (%rsi), %zmm0
vmovups %zmm0, (%rdi)
vzeroupper
ret
(huh? isn't there a NR step missing?)
for -mprefer-vector-width=256 I get (irrespective of -mavx512er):
rsqrt:
.LFB12:
.cfi_startproc
vmovups (%rsi), %ymm1
vmovaps .LC1(%rip), %ymm3
vrsqrtps %ymm1, %ymm2
vmovaps .LC0(%rip), %ymm4
vmovups 32(%rsi), %ymm0
vmulps %ymm1, %ymm2, %ymm1
vmulps %ymm2, %ymm1, %ymm1
vmulps %ymm3, %ymm2, %ymm2
vaddps %ymm4, %ymm1, %ymm1
vmulps %ymm2, %ymm1, %ymm1
vmovups %ymm1, (%rdi)
vrsqrtps %ymm0, %ymm1
vmulps %ymm0, %ymm1, %ymm0
vmulps %ymm1, %ymm0, %ymm0
vmulps %ymm3, %ymm1, %ymm1
vaddps %ymm4, %ymm0, %ymm0
vmulps %ymm1, %ymm0, %ymm0
vmovups %ymm0, 32(%rdi)
vzeroupper
so the issue lies somewhere in the backend.
Of the "fast" you need -ffinite-math-only -fno-math-errno
-funsafe-math-optimizations.
GCC definitely fails to see the FMA use as opportunity in
ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
expander w/o avx512er where we could still use the NR sequence
with the other instruction. HJ?