[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #26 from Hongtao.liu  ---

> I guess that's possible but the SLP vectorizer has a permute optimization
> phase (and SLP discovery itself), it would be nice to see why the former
> doesn't elide the permutes here.

I've opened PR107891 for it.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #25 from rguenther at suse dot de  ---
On Mon, 28 Nov 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
> 
> --- Comment #24 from Hongtao.liu  ---
>   _233 = {f_im_36, f_re_35, f_re_35, f_re_35};
>   _217 = {f_re_35, f_im_36, f_im_36, f_im_36};
> ...
> vect_x_re_55.15_227 = VEC_PERM_EXPR  { 0, 5, 6, 7 }>;
>   vect_x_re_55.23_211 = VEC_PERM_EXPR  vect_x_im_61.14_228, { 0, 5, 6, 7 }>;
> ...
>   vect_y_re_69.17_224 = .FNMA (vect_x_re_55.15_227, _233, vect_y_re_63.9_237);
>   vect_y_re_69.25_208 = .FNMA (vect_x_re_55.23_211, _217, 
> vect_y_re_69.17_224);
> 
> is equal to
> 
>   _233 = {f_im_36,f_im_36, f_im_36, f_im_36}
>   _217 = {f_re_35, f_re_35, f_re_35, f_re_35};
> ...
>   vect_y_re_69.17_224 = .FNMA (vect_x_im_61.14_228, _233, vect_y_re_63.9_237)
>   vect_y_re_69.25_208 = .FNMA (vect_x_im_61.13_230, _217, vect_y_re_69.17_224)
> 
> A simplication in match.pd?

I guess that's possible but the SLP vectorizer has a permute optimization
phase (and SLP discovery itself), it would be nice to see why the former
doesn't elide the permutes here.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #24 from Hongtao.liu  ---
  _233 = {f_im_36, f_re_35, f_re_35, f_re_35};
  _217 = {f_re_35, f_im_36, f_im_36, f_im_36};
...
vect_x_re_55.15_227 = VEC_PERM_EXPR ;
  vect_x_re_55.23_211 = VEC_PERM_EXPR ;
...
  vect_y_re_69.17_224 = .FNMA (vect_x_re_55.15_227, _233, vect_y_re_63.9_237);
  vect_y_re_69.25_208 = .FNMA (vect_x_re_55.23_211, _217, vect_y_re_69.17_224);

is equal to

  _233 = {f_im_36,f_im_36, f_im_36, f_im_36}
  _217 = {f_re_35, f_re_35, f_re_35, f_re_35};
...
  vect_y_re_69.17_224 = .FNMA (vect_x_im_61.14_228, _233, vect_y_re_63.9_237)
  vect_y_re_69.25_208 = .FNMA (vect_x_im_61.13_230, _217, vect_y_re_69.17_224)

A simplication in match.pd?

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #23 from Hongtao.liu  ---

> the blends do not look like no-ops so I wonder if this is really computing
> the same thing ... (it swaps lane 0 from the two loads from x but not the
> stores)

They're computing the same thing since we also do the same "permutation" for
the invariants: f_re and f_imm, can we eliminate that in the vectorizer?

  _232 = {f_im_36, f_im_36, f_im_36, f_im_36};
  _231 = {f_im_36, f_re_35, f_re_35, f_re_35}; --- here
  _216 = {f_re_35, f_re_35, f_re_35, f_re_35};
  _215 = {f_re_35, f_im_36, f_im_36, f_im_36}; -- and here.
  ivtmp.36_221 = (unsigned long) y_41(D);
  ivtmp.38_61 = (unsigned long) x_33(D);

   [local count: 214748368]:
  # ivtmp.32_66 = PHI 
  # ivtmp.36_64 = PHI 
  # ivtmp.38_220 = PHI 
  # DEBUG c => NULL
  # DEBUG k => 0
  # DEBUG BEGIN_STMT
  # DEBUG BEGIN_STMT
  # DEBUG D#78 => D#79 * 8
  # DEBUG D#77 => x_33(D) + D#78
  _62 = (void *) ivtmp.38_220;
  vect_x_im_61.13_228 = MEM  [(const double *)_62];
  vect_x_im_61.14_226 = MEM  [(const double *)_62 +
32B];
  vect_x_re_55.15_225 = VEC_PERM_EXPR ;
  vect_x_re_55.23_209 = VEC_PERM_EXPR ;
  # DEBUG D#76 => *D#77
  # DEBUG x_re => D#76
  # DEBUG BEGIN_STMT
  # DEBUG D#74 => (long unsigned int) D#75
  # DEBUG D#73 => D#74 * 8
  # DEBUG D#72 => x_33(D) + D#73
  # DEBUG D#71 => *D#72
  # DEBUG x_im => D#71
  # DEBUG BEGIN_STMT
  # DEBUG D#70 => y_41(D) + D#78
  _59 = (void *) ivtmp.36_64;
  vect_y_re_63.9_235 = MEM  [(double *)_59];
  vect_y_re_63.10_233 = MEM  [(double *)_59 + 32B];
  vect__42.18_219 = .FMA (vect_x_im_61.13_228, _232, vect_y_re_63.10_233);
  vect_y_re_69.17_222 = .FNMA (vect_x_re_55.15_225, _231, vect_y_re_63.9_235);
  vect_y_re_69.25_206 = .FNMA (vect_x_re_55.23_209, _215, vect_y_re_69.17_222);
  vect_y_re_69.25_205 = .FNMA (_216, vect_x_im_61.14_226, vect__42.18_219);

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #22 from Michael_S  ---
(In reply to Alexander Monakov from comment #21)
> (In reply to Michael_S from comment #19)
> > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > > IVs for the two arrays actually helps on this testcase.
> > 
> > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
> > %ymm3, %ymm0' would be turned into 2 uops.
> 
> The difference is at which point in the pipeline. The latter goes through
> renaming as one fused uop.
> 

Intel never documents such fine details in their Optimization Reference
manuals.
But I believe you.

> > Misuse of load+op is far bigger problem in this particular test case than
> > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> > loop that can potentially run at 3 clocks per iteration into loop of 4+
> > clocks per iteration.
> 
> Sorry, which assembler output this refers to?
>

gcc12 -O3 -mavx2 -mfma

gcc12 -O3 -march=skylake does not suffer from this problem.

I still think that RISC-style icc code will be a little faster on Skylake, but
here we are arguing about 1/4th of the cycle per iteration rather than a full
cycle.
https://godbolt.org/z/nfa7c9se3

> > But I consider it a separate issue. I reported similar issue in 97127, but
> > here it is more serious. It looks to me that the issue is not soluble within
> > existing gcc optimization framework. The only chance is if you accept my old
> > and simple advice - within inner loops pretend that AVX is RISC, i.e.
> > generate code as if load-op form of AVX instructions weren't existing.
> 
> In bug 97127 the best explanation we have so far is we don't optimally
> handle the case where non-memory inputs of an fma are reused, so we can't
> combine a load with an fma without causing an extra register copy (PR 97127
> comment 16 demonstrates what I mean). I cannot imagine such trouble arising
> with more common commutative operations like mul/add, especially with
> non-destructive VEX encoding. If you hit such examples, I would suggest to
> report them also, because their root cause might be different.
> 
> In general load-op combining should be very helpful on x86, because it
> reduces the number of uops flowing through the renaming stage, which is one
> of the narrowest points in the pipeline.

If compilers were perfect, AVX load-op combining would be somewhat helpful. I
have my doubts about very helpful. But compilers are not perfect.
For none-AVX case, where every op is destructive and repeated loads are on
average cheaper than on AVX, combined load-ops is far more profitable.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #21 from Alexander Monakov  ---
(In reply to Michael_S from comment #19)
> > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > IVs for the two arrays actually helps on this testcase.
> 
> Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
> %ymm3, %ymm0' would be turned into 2 uops.

The difference is at which point in the pipeline. The latter goes through
renaming as one fused uop.

> Misuse of load+op is far bigger problem in this particular test case than
> sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> loop that can potentially run at 3 clocks per iteration into loop of 4+
> clocks per iteration.

Sorry, which assembler output this refers to?

> But I consider it a separate issue. I reported similar issue in 97127, but
> here it is more serious. It looks to me that the issue is not soluble within
> existing gcc optimization framework. The only chance is if you accept my old
> and simple advice - within inner loops pretend that AVX is RISC, i.e.
> generate code as if load-op form of AVX instructions weren't existing.

In bug 97127 the best explanation we have so far is we don't optimally handle
the case where non-memory inputs of an fma are reused, so we can't combine a
load with an fma without causing an extra register copy (PR 97127 comment 16
demonstrates what I mean). I cannot imagine such trouble arising with more
common commutative operations like mul/add, especially with non-destructive VEX
encoding. If you hit such examples, I would suggest to report them also,
because their root cause might be different.

In general load-op combining should be very helpful on x86, because it reduces
the number of uops flowing through the renaming stage, which is one of the
narrowest points in the pipeline.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #20 from Michael_S  ---
(In reply to Richard Biener from comment #17)
> (In reply to Michael_S from comment #16)
> > On unrelated note, why loop overhead uses so many instructions?
> > Assuming that I am as misguided as gcc about load-op combining, I would
> > write it as:
> >   sub %rax, %rdx
> > .L3:
> >   vmovupd   (%rdx,%rax), %ymm1
> >   vmovupd 32(%rdx,%rax), %ymm0
> >   vfmadd213pd32(%rax), %ymm3, %ymm1
> >   vfnmadd213pd (%rax), %ymm2, %ymm0
> >   vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
> >   vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
> >   vmovupd %ymm0,   (%rax)
> >   vmovupd %ymm1, 32(%rax)
> >   addq$64, %rax
> >   decl%esi
> >   jb  .L3
> >   
> > The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
> > vs 5 x86 instructions==4 macro-ops in gcc variant.
> > Also, in gcc variant all memory accesses have displacement that makes them
> > 1 byte longer. In my variant only half of accesses have displacement.
> > 
> > I think, in the past I had seen cases where gcc generates optimal or
> > near-optimal
> > code sequences for loop overhead. I wonder why it can not do it here.
> 
> I don't think we currently consider IVs based on the difference of two
> addresses.  

It seems to me that I had seen you doing it.
But, may be, I confuse gcc with clang.

> The cost benefit of no displacement is only size, 

Size is pretty important in high-IPC SIMD loops. Esp. on Intel and when # of
iterations is small, because Intel has 16-byte fetch out of L1I cache. SIMD
instructions tend to be long and not many instructions fit within 16 bytes even
when memory accesses have no offsets. Offset adds impact to the injury.

> otherwise
> I have no idea why we have biased the %rax accesses by -32.  Why we
> fail to consider decrement-to-zero for the counter IV is probably because
> IVCANON would add such IV but the vectorizer replaces that and IVOPTs
> doesn't consider re-adding that.

Sorry, I have no idea about the meaning of IVCANON.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #19 from Michael_S  ---
(In reply to Alexander Monakov from comment #18)
> The apparent 'bias' is introduced by instruction scheduling: haifa-sched
> lifts a +64 increment over memory accesses, transforming +0 and +32
> displacements to -64 and -32. Sometimes this helps a little bit even on
> modern x86 CPUs.

I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Zen1
or later.

> 
> Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> IVs for the two arrays actually helps on this testcase.

Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
%ymm3, %ymm0' would be turned into 2 uops.

Misuse of load+op is far bigger problem in this particular test case than
sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns loop
that can potentially run at 3 clocks per iteration into loop of 4+ clocks per
iteration.
But I consider it a separate issue. I reported similar issue in 97127, but here
it is more serious. It looks to me that the issue is not soluble within
existing gcc optimization framework. The only chance is if you accept my old
and simple advice - within inner loops pretend that AVX is RISC, i.e. generate
code as if load-op form of AVX instructions weren't existing.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #18 from Alexander Monakov  ---
The apparent 'bias' is introduced by instruction scheduling: haifa-sched lifts
a +64 increment over memory accesses, transforming +0 and +32 displacements to
-64 and -32. Sometimes this helps a little bit even on modern x86 CPUs.

Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
'unlaminated' (turned to 2 uops before renaming), so selecting independent IVs
for the two arrays actually helps on this testcase.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #17 from Richard Biener  ---
(In reply to Michael_S from comment #16)
> On unrelated note, why loop overhead uses so many instructions?
> Assuming that I am as misguided as gcc about load-op combining, I would
> write it as:
>   sub %rax, %rdx
> .L3:
>   vmovupd   (%rdx,%rax), %ymm1
>   vmovupd 32(%rdx,%rax), %ymm0
>   vfmadd213pd32(%rax), %ymm3, %ymm1
>   vfnmadd213pd (%rax), %ymm2, %ymm0
>   vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
>   vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
>   vmovupd %ymm0,   (%rax)
>   vmovupd %ymm1, 32(%rax)
>   addq$64, %rax
>   decl%esi
>   jb  .L3
>   
> The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
> vs 5 x86 instructions==4 macro-ops in gcc variant.
> Also, in gcc variant all memory accesses have displacement that makes them
> 1 byte longer. In my variant only half of accesses have displacement.
> 
> I think, in the past I had seen cases where gcc generates optimal or
> near-optimal
> code sequences for loop overhead. I wonder why it can not do it here.

I don't think we currently consider IVs based on the difference of two
addresses.  The cost benefit of no displacement is only size, otherwise
I have no idea why we have biased the %rax accesses by -32.  Why we
fail to consider decrement-to-zero for the counter IV is probably because
IVCANON would add such IV but the vectorizer replaces that and IVOPTs
doesn't consider re-adding that.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #16 from Michael_S  ---
On unrelated note, why loop overhead uses so many instructions?
Assuming that I am as misguided as gcc about load-op combining, I would write
it as:
  sub %rax, %rdx
.L3:
  vmovupd   (%rdx,%rax), %ymm1
  vmovupd 32(%rdx,%rax), %ymm0
  vfmadd213pd32(%rax), %ymm3, %ymm1
  vfnmadd213pd (%rax), %ymm2, %ymm0
  vfnmadd231pd   32(%rdx,%rax), %ymm3, %ymm0
  vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1
  vmovupd %ymm0,   (%rax)
  vmovupd %ymm1, 32(%rax)
  addq$64, %rax
  decl%esi
  jb  .L3

The loop overhead in my variant is 3 x86 instructions==2 macro-ops,
vs 5 x86 instructions==4 macro-ops in gcc variant.
Also, in gcc variant all memory accesses have displacement that makes them
1 byte longer. In my variant only half of accesses have displacement.

I think, in the past I had seen cases where gcc generates optimal or
near-optimal
code sequences for loop overhead. I wonder why it can not do it here.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #15 from Richard Biener  ---
I can confirm we get

.L3:
vmovupd (%rsi), %ymm1
vmovupd 32(%rsi), %ymm0
addl$1, %eax
addq$64, %rdi
addq$64, %rsi
vblendpd$14, %ymm1, %ymm0, %ymm3
vblendpd$14, %ymm0, %ymm1, %ymm2
vfnmadd213pd-64(%rdi), %ymm5, %ymm3
vfmadd213pd -32(%rdi), %ymm7, %ymm1
vfnmadd132pd%ymm4, %ymm3, %ymm2
vfnmadd132pd%ymm6, %ymm1, %ymm0
vmovupd %ymm2, -64(%rdi)
vmovupd %ymm0, -32(%rdi)
cmpl%edx, %eax
jb  .L3

instead of

.L3:
vmovupd (%rdx), %ymm1
vmovupd (%rdx), %ymm0
addl$1, %ecx
addq$64, %rax
vfmadd213pd -32(%rax), %ymm3, %ymm1
vfnmadd213pd-64(%rax), %ymm2, %ymm0
addq$64, %rdx
vfnmadd231pd-32(%rdx), %ymm3, %ymm0
vfnmadd231pd-32(%rdx), %ymm2, %ymm1
vmovupd %ymm0, -64(%rax)
vmovupd %ymm1, -32(%rax)
cmpl%esi, %ecx
jb  .L3

the good case sees

   [local count: 214748368]:
  # ivtmp.27_211 = PHI 
  # ivtmp.32_209 = PHI 
  # ivtmp.34_28 = PHI 
  _53 = (void *) ivtmp.34_28;
  vect_x_re_54.13_193 = MEM  [(const double *)_53];
  vect_x_im_60.21_176 = MEM  [(const double *)_53 +
32B];
  _54 = (void *) ivtmp.32_209;
  vect_y_re_62.9_200 = MEM  [(double *)_54];
  vect_y_re_62.10_198 = MEM  [(double *)_54 + 32B];
  vect__154.17_185 = .FMA (vect_x_re_54.13_193, _197, vect_y_re_62.10_198);
  vect__66.16_188 = .FNMA (vect_x_re_54.13_193, _196, vect_y_re_62.9_200);
  vect_y_re_68.23_173 = .FNMA (vect_x_im_60.21_176, _197, vect__66.16_188);
  vect_y_re_68.23_172 = .FNMA (vect_x_im_60.21_176, _196, vect__154.17_185);
  MEM  [(double *)_54] = vect_y_re_68.23_173;
  MEM  [(double *)_54 + 32B] = vect_y_re_68.23_172;
  ivtmp.27_210 = ivtmp.27_211 + 1;
  ivtmp.32_208 = ivtmp.32_209 + 64;
  ivtmp.34_51 = ivtmp.34_28 + 64;
  if (bnd.6_207 > ivtmp.27_210)
goto ; [90.00%]

while the bad has

   [local count: 214748368]:
  # ivtmp.31_65 = PHI 
  # ivtmp.36_63 = PHI 
  # ivtmp.38_203 = PHI 
  _61 = (void *) ivtmp.38_203;
  vect_x_im_60.13_211 = MEM  [(const double *)_61];
  vect_x_im_60.14_209 = MEM  [(const double *)_61 +
32B];
  vect_x_re_54.15_208 = VEC_PERM_EXPR ;
  vect_x_re_54.23_192 = VEC_PERM_EXPR ;
  _58 = (void *) ivtmp.36_63;
  vect_y_re_62.9_218 = MEM  [(double *)_58];
  vect_y_re_62.10_216 = MEM  [(double *)_58 + 32B];
  vect__41.18_202 = .FMA (vect_x_im_60.13_211, _215, vect_y_re_62.10_216);
  vect_y_re_68.17_205 = .FNMA (vect_x_re_54.15_208, _214, vect_y_re_62.9_218);
  vect_y_re_68.25_189 = .FNMA (vect_x_re_54.23_192, _198, vect_y_re_68.17_205);
  vect_y_re_68.25_188 = .FNMA (_199, vect_x_im_60.14_209, vect__41.18_202);
  MEM  [(double *)_58] = vect_y_re_68.25_189;
  MEM  [(double *)_58 + 32B] = vect_y_re_68.25_188;
  ivtmp.31_64 = ivtmp.31_65 + 1;
  ivtmp.36_62 = ivtmp.36_63 + 64;
  ivtmp.38_59 = ivtmp.38_203 + 64;
  if (ivtmp.31_64 < bnd.6_225)
goto ; [90.00%]

the blends do not look like no-ops so I wonder if this is really computing
the same thing ... (it swaps lane 0 from the two loads from x but not the
stores)

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #14 from Michael_S  ---
I tested a smaller test bench from Comment 3 with gcc trunk on godbolt.
Issue appears to be only partially fixed.
-Ofast result is no longer a horror that it was before, but it is still not as
good as -O3 or -O2. -Ofast code generation is still strange and there are few
vblendpd instruction that serve no useful purpose.

And -O2/O3 is still not as good as it should be or as good as icc.
But, as mentioned in my original post, over-aggressive load+op combining is a
separate problem.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-01-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |12.0

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2021-06-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
  Known to work||12.0

--- Comment #13 from Richard Biener  ---
Fixed (hopefully), for GCC 12.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2021-06-09 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #12 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:ce670e4faafb296d1f1a7828d20f8c8ba4686797

commit r12-1329-gce670e4faafb296d1f1a7828d20f8c8ba4686797
Author: Richard Biener 
Date:   Wed Nov 18 14:17:34 2020 +0100

tree-optimization/97832 - handle associatable chains in SLP discovery

This makes SLP discovery handle associatable (including mixed
plus/minus) chains better by swapping operands across the whole
chain.  To work this adds caching of the 'matches' lanes for
failed SLP discovery attempts, thereby fixing a failed SLP
discovery for the slp-pr98855.cc testcase which results in
building an operand from scalars as expected.  Unfortunately
this makes us trip over the cost threshold so I'm XFAILing the
testcase for now.

For BB vectorization all this doesn't work because we have no way
to distinguish good from bad associations as we eventually build
operands from scalars and thus not fail in the classical sense.

2021-05-31  Richard Biener  

PR tree-optimization/97832
* tree-vectorizer.h (_slp_tree::failed): New.
* tree-vect-slp.c (_slp_tree::_slp_tree): Initialize
failed member.
(_slp_tree::~_slp_tree): Free failed.
(vect_build_slp_tree): Retain failed nodes and record
matches in them, copying that back out when running
into a cached fail.  Dump start and end of discovery.
(dt_sort_cmp): New.
(vect_build_slp_tree_2): Handle associatable chains
together doing more aggressive operand swapping.

* gcc.dg/vect/pr97832-1.c: New testcase.
* gcc.dg/vect/pr97832-2.c: Likewise.
* gcc.dg/vect/pr97832-3.c: Likewise.
* g++.dg/vect/slp-pr98855.cc: XFAIL.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #11 from Richard Biener  ---
(In reply to Michael_S from comment #10)
> I lost track of what you're talking about long time ago.
> But that's o.k.

No problem - difficult PRs tend to be used as media to brain-dump and record
work progress.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-19 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #10 from Michael_S  ---
I lost track of what you're talking about long time ago.
But that's o.k.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #9 from Richard Biener  ---
There's then also a permute optimization left on the plate:

t.c:16:3: note:   node 0x3a19590 (max_nunits=4, refcnt=2)
t.c:16:3: note: stmt 0 _153 = f11_im_76 * x1_im_142;
t.c:16:3: note: stmt 1 _213 = f11_re_72 * x1_re_202;
t.c:16:3: note: stmt 2 _275 = f11_re_72 * x1_re_264;
t.c:16:3: note: stmt 3 _337 = f11_re_72 * x1_re_326;
t.c:16:3: note: stmt 4 _155 = f11_im_76 * x1_re_140;
t.c:16:3: note: stmt 5 _217 = f11_im_76 * x1_re_202;
t.c:16:3: note: stmt 6 _279 = f11_im_76 * x1_re_264;
t.c:16:3: note: stmt 7 _341 = f11_im_76 * x1_re_326;
t.c:16:3: note: children 0x3a19600 0x3a19670
t.c:16:3: note:   node (external) 0x3a19600 (max_nunits=1, refcnt=1)
t.c:16:3: note: { f11_im_76, f11_re_72, f11_re_72, f11_re_72,
f11_im_76, f11_im_76, f11_im_76, f11_im_76 }
t.c:16:3: note:   node 0x3a19670 (max_nunits=4, refcnt=1)
t.c:16:3: note: stmt 0 x1_im_142 = *_141;
t.c:16:3: note: stmt 1 x1_re_202 = *_201;
t.c:16:3: note: stmt 2 x1_re_264 = *_263;
t.c:16:3: note: stmt 3 x1_re_326 = *_325;
t.c:16:3: note: stmt 4 x1_re_140 = *_139;
t.c:16:3: note: stmt 5 x1_re_202 = *_201;
t.c:16:3: note: stmt 6 x1_re_264 = *_263;
t.c:16:3: note: stmt 7 x1_re_326 = *_325;
t.c:16:3: note: load permutation { 4 1 2 3 0 1 2 3 }

which we currently do not handle (there's a FIXME as to permute externals,
currently we only handle splats as transparent for permutes).

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #8 from Richard Biener  ---
Created attachment 49586
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49586&action=edit
prototype

This is a prototype patch which can serve as proof-of-concept.  It needs
cleanup plus better handling of hybrid SLP discovery.

It depends on
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559347.html to fix the
testcase in this PR (which is included in the patch).

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #7 from Richard Biener  ---
Or

double a[1024], b[1024], c[1024];

void foo()
{
  for (int i = 0; i < 256; ++i)
{
  a[2*i] = 1. - a[2*i] + b[2*i];
  a[2*i+1] = 1 + a[2*i+1] - b[2*i+1];
}
}

which early folding breaks unless we add -fno-associative-math.  We then
end up with

  a[_1] = (((b[_1]) - (a[_1])) + 1.0e+0);
  a[_6] = (((a[_6]) - (b[_6])) + 1.0e+0);

where SLP operator swaping cannot handle to bring the grouped loads into
the same lanes.

So the idea is to look at single-use chains of plus/minus operations and
handle those as wide associated SLP nodes with flags denoting which lanes
need negation.  We'd have three children and each child has a per-lane
spec whether to add or subtract.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #6 from Richard Biener  ---
So for example we'd like to vectorize with SLP when reassociation is permitted
(thus with -Ofast for example):

double a[1024], b[1024], c[1024];

void foo()
{
  for (int i = 0; i < 256; ++i)
{
  a[2*i] = 1. - a[2*i] + b[2*i];
  a[2*i+1] = a[2*i+1] + b[2*i+1] + 1.;
}
}

it again works when written as follows and with -fno-tree-reassoc

double a[1024], b[1024], c[1024];

void foo()
{
  for (int i = 0; i < 256; ++i)
{
  a[2*i] = 1. - a[2*i] + b[2*i];
  a[2*i+1] = 1 + a[2*i+1] + b[2*i+1];
}
}

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #5 from Richard Biener  ---
OK, so I have a patch to keep the association linear which IMHO is good.  It
fixes the smaller and my testcase but not the original one which now is
linear but still not homogenous.  The store groups are as follows

  *_115 = (*_115) - (f00_re_68 * x0_re_108)) - (f10_re_70 * x1_re_140)) -
(f00_im_73 * x0_im_114)) - (f10_im_74 * x1_im_142));
  *_117 = (*_117) + (f00_im_73 * x0_re_108)) + (f10_im_74 * x1_re_140)) -
(f00_re_68 * x0_im_114)) - (f10_re_70 * x1_im_142));
  *_119 = (*_119) - (f01_re_71 * x0_re_108)) - (f11_re_72 * x1_re_140)) -
(f01_im_75 * x0_im_114)) - (f11_im_76 * x1_im_142));
  *_121 = (*_121) + (f01_im_75 * x0_re_108)) + (f11_im_76 * x1_re_140)) -
(f01_re_71 * x0_im_114)) - (f11_re_72 * x1_im_142));

(good)

  *_177 = (*_177) - (f00_re_68 * x0_re_170)) - (f00_im_73 * x0_im_176)) -
(f10_re_70 * x1_re_202)) - (f10_im_74 * x1_im_204));
  *_179 = (f00_im_73 * x0_re_170) + (f10_im_74 * x1_re_202)) + (*_179)) -
(f00_re_68 * x0_im_176)) - (f10_re_70 * x1_im_204));
  *_181 = (*_181) - (f01_re_71 * x0_re_170)) - (f01_im_75 * x0_im_176)) -
(f11_re_72 * x1_re_202)) - (f11_im_76 * x1_im_204));
  *_183 = (f01_im_75 * x0_re_170) + (f11_im_76 * x1_re_202)) + (*_183)) -
(f01_re_71 * x0_im_176)) - (f11_re_72 * x1_im_204));

already bad.  Now, this is sth to tackle in the vectorizer which ideally
should not try to match up individual adds during SLP discoverly but
instead (if association is allowed) the whole addition chain, commutating
within the whole change rather than just swapping individual add operands.

I still think the reassoc change I came up with is good since it avoids
the need to linearlize in the vectorizer.  So testing that now.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

Richard Biener  changed:

   What|Removed |Added

  Component|target  |tree-optimization

--- Comment #4 from Richard Biener  ---
Ah, thanks - that helps.  So we're re-associating from

  *_89 = (((*_89) - (f_re_34 * x_re_82)) - (f_im_35 * x_im_88));
  *_91 = (((*_91) + (f_im_35 * x_re_82)) - (f_re_34 * x_im_88));

to

  *_89 = ((*_89) - ((f_re_34 * x_re_82) + (f_im_35 * x_im_88)));
  *_91 = (((*_91) + (f_im_35 * x_re_82)) - (f_re_34 * x_im_88));

that makes the operations unbalanced.  This is (a - b) - c -> a - (b + c)
as we're optimizing this as a + -b + -c.

Even smaller testcase:

double a[1024], b[1024], c[1024];

void foo()
{
  for (int i = 0; i < 256; ++i)
{
  a[2*i] = a[2*i] + b[2*i] - c[2*i];
  a[2*i+1] = a[2*i+1] - b[2*i+1] - c[2*i+1];
}
}

here ranks end up associating the expr as (-b + -c) + a and negate
re-propagation goes (-b - c) + a -> -(b + c) + a -> a - (b + c)
which is all sensible in isolation.

You could say that associating as (-b + -c) + a is worse than
(a + -b) + -c in this respect.  Ranks are

Rank for _8 is 327683 (a)
Rank for _13 is 327684 (-b)
Rank for _21 is 327684 (-c)

where the rank is one more for the negated values because of the
negate operation.  While heuristically ignoring negates for rank
propagation to make all ranks equal helps this new testcase it
doesn't help for the larger two.

It might still be a generally sound heuristic improvement though.

For the effects on vectorization I think we need to do sth in the
vectorizer itself, for example linearizing expressions.  The
first reassoc pass is supposed to do this but then negate
re-propagation undoes it in this case - which maybe points to
it that needs fixing, somehow associating a not negated operand
first.