[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #26 from Hongtao.liu --- > I guess that's possible but the SLP vectorizer has a permute optimization > phase (and SLP discovery itself), it would be nice to see why the former > doesn't elide the permutes here. I've opened PR107891 for it.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #25 from rguenther at suse dot de --- On Mon, 28 Nov 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 > > --- Comment #24 from Hongtao.liu --- > _233 = {f_im_36, f_re_35, f_re_35, f_re_35}; > _217 = {f_re_35, f_im_36, f_im_36, f_im_36}; > ... > vect_x_re_55.15_227 = VEC_PERM_EXPR { 0, 5, 6, 7 }>; > vect_x_re_55.23_211 = VEC_PERM_EXPR vect_x_im_61.14_228, { 0, 5, 6, 7 }>; > ... > vect_y_re_69.17_224 = .FNMA (vect_x_re_55.15_227, _233, vect_y_re_63.9_237); > vect_y_re_69.25_208 = .FNMA (vect_x_re_55.23_211, _217, > vect_y_re_69.17_224); > > is equal to > > _233 = {f_im_36,f_im_36, f_im_36, f_im_36} > _217 = {f_re_35, f_re_35, f_re_35, f_re_35}; > ... > vect_y_re_69.17_224 = .FNMA (vect_x_im_61.14_228, _233, vect_y_re_63.9_237) > vect_y_re_69.25_208 = .FNMA (vect_x_im_61.13_230, _217, vect_y_re_69.17_224) > > A simplication in match.pd? I guess that's possible but the SLP vectorizer has a permute optimization phase (and SLP discovery itself), it would be nice to see why the former doesn't elide the permutes here.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #24 from Hongtao.liu --- _233 = {f_im_36, f_re_35, f_re_35, f_re_35}; _217 = {f_re_35, f_im_36, f_im_36, f_im_36}; ... vect_x_re_55.15_227 = VEC_PERM_EXPR ; vect_x_re_55.23_211 = VEC_PERM_EXPR ; ... vect_y_re_69.17_224 = .FNMA (vect_x_re_55.15_227, _233, vect_y_re_63.9_237); vect_y_re_69.25_208 = .FNMA (vect_x_re_55.23_211, _217, vect_y_re_69.17_224); is equal to _233 = {f_im_36,f_im_36, f_im_36, f_im_36} _217 = {f_re_35, f_re_35, f_re_35, f_re_35}; ... vect_y_re_69.17_224 = .FNMA (vect_x_im_61.14_228, _233, vect_y_re_63.9_237) vect_y_re_69.25_208 = .FNMA (vect_x_im_61.13_230, _217, vect_y_re_69.17_224) A simplication in match.pd?
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #23 from Hongtao.liu --- > the blends do not look like no-ops so I wonder if this is really computing > the same thing ... (it swaps lane 0 from the two loads from x but not the > stores) They're computing the same thing since we also do the same "permutation" for the invariants: f_re and f_imm, can we eliminate that in the vectorizer? _232 = {f_im_36, f_im_36, f_im_36, f_im_36}; _231 = {f_im_36, f_re_35, f_re_35, f_re_35}; --- here _216 = {f_re_35, f_re_35, f_re_35, f_re_35}; _215 = {f_re_35, f_im_36, f_im_36, f_im_36}; -- and here. ivtmp.36_221 = (unsigned long) y_41(D); ivtmp.38_61 = (unsigned long) x_33(D); [local count: 214748368]: # ivtmp.32_66 = PHI # ivtmp.36_64 = PHI # ivtmp.38_220 = PHI # DEBUG c => NULL # DEBUG k => 0 # DEBUG BEGIN_STMT # DEBUG BEGIN_STMT # DEBUG D#78 => D#79 * 8 # DEBUG D#77 => x_33(D) + D#78 _62 = (void *) ivtmp.38_220; vect_x_im_61.13_228 = MEM [(const double *)_62]; vect_x_im_61.14_226 = MEM [(const double *)_62 + 32B]; vect_x_re_55.15_225 = VEC_PERM_EXPR ; vect_x_re_55.23_209 = VEC_PERM_EXPR ; # DEBUG D#76 => *D#77 # DEBUG x_re => D#76 # DEBUG BEGIN_STMT # DEBUG D#74 => (long unsigned int) D#75 # DEBUG D#73 => D#74 * 8 # DEBUG D#72 => x_33(D) + D#73 # DEBUG D#71 => *D#72 # DEBUG x_im => D#71 # DEBUG BEGIN_STMT # DEBUG D#70 => y_41(D) + D#78 _59 = (void *) ivtmp.36_64; vect_y_re_63.9_235 = MEM [(double *)_59]; vect_y_re_63.10_233 = MEM [(double *)_59 + 32B]; vect__42.18_219 = .FMA (vect_x_im_61.13_228, _232, vect_y_re_63.10_233); vect_y_re_69.17_222 = .FNMA (vect_x_re_55.15_225, _231, vect_y_re_63.9_235); vect_y_re_69.25_206 = .FNMA (vect_x_re_55.23_209, _215, vect_y_re_69.17_222); vect_y_re_69.25_205 = .FNMA (_216, vect_x_im_61.14_226, vect__42.18_219);
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #22 from Michael_S --- (In reply to Alexander Monakov from comment #21) > (In reply to Michael_S from comment #19) > > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > > > IVs for the two arrays actually helps on this testcase. > > > > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), > > %ymm3, %ymm0' would be turned into 2 uops. > > The difference is at which point in the pipeline. The latter goes through > renaming as one fused uop. > Intel never documents such fine details in their Optimization Reference manuals. But I believe you. > > Misuse of load+op is far bigger problem in this particular test case than > > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > > loop that can potentially run at 3 clocks per iteration into loop of 4+ > > clocks per iteration. > > Sorry, which assembler output this refers to? > gcc12 -O3 -mavx2 -mfma gcc12 -O3 -march=skylake does not suffer from this problem. I still think that RISC-style icc code will be a little faster on Skylake, but here we are arguing about 1/4th of the cycle per iteration rather than a full cycle. https://godbolt.org/z/nfa7c9se3 > > But I consider it a separate issue. I reported similar issue in 97127, but > > here it is more serious. It looks to me that the issue is not soluble within > > existing gcc optimization framework. The only chance is if you accept my old > > and simple advice - within inner loops pretend that AVX is RISC, i.e. > > generate code as if load-op form of AVX instructions weren't existing. > > In bug 97127 the best explanation we have so far is we don't optimally > handle the case where non-memory inputs of an fma are reused, so we can't > combine a load with an fma without causing an extra register copy (PR 97127 > comment 16 demonstrates what I mean). I cannot imagine such trouble arising > with more common commutative operations like mul/add, especially with > non-destructive VEX encoding. If you hit such examples, I would suggest to > report them also, because their root cause might be different. > > In general load-op combining should be very helpful on x86, because it > reduces the number of uops flowing through the renaming stage, which is one > of the narrowest points in the pipeline. If compilers were perfect, AVX load-op combining would be somewhat helpful. I have my doubts about very helpful. But compilers are not perfect. For none-AVX case, where every op is destructive and repeated loads are on average cheaper than on AVX, combined load-ops is far more profitable.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #21 from Alexander Monakov --- (In reply to Michael_S from comment #19) > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > > IVs for the two arrays actually helps on this testcase. > > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), > %ymm3, %ymm0' would be turned into 2 uops. The difference is at which point in the pipeline. The latter goes through renaming as one fused uop. > Misuse of load+op is far bigger problem in this particular test case than > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > loop that can potentially run at 3 clocks per iteration into loop of 4+ > clocks per iteration. Sorry, which assembler output this refers to? > But I consider it a separate issue. I reported similar issue in 97127, but > here it is more serious. It looks to me that the issue is not soluble within > existing gcc optimization framework. The only chance is if you accept my old > and simple advice - within inner loops pretend that AVX is RISC, i.e. > generate code as if load-op form of AVX instructions weren't existing. In bug 97127 the best explanation we have so far is we don't optimally handle the case where non-memory inputs of an fma are reused, so we can't combine a load with an fma without causing an extra register copy (PR 97127 comment 16 demonstrates what I mean). I cannot imagine such trouble arising with more common commutative operations like mul/add, especially with non-destructive VEX encoding. If you hit such examples, I would suggest to report them also, because their root cause might be different. In general load-op combining should be very helpful on x86, because it reduces the number of uops flowing through the renaming stage, which is one of the narrowest points in the pipeline.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #20 from Michael_S --- (In reply to Richard Biener from comment #17) > (In reply to Michael_S from comment #16) > > On unrelated note, why loop overhead uses so many instructions? > > Assuming that I am as misguided as gcc about load-op combining, I would > > write it as: > > sub %rax, %rdx > > .L3: > > vmovupd (%rdx,%rax), %ymm1 > > vmovupd 32(%rdx,%rax), %ymm0 > > vfmadd213pd32(%rax), %ymm3, %ymm1 > > vfnmadd213pd (%rax), %ymm2, %ymm0 > > vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0 > > vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1 > > vmovupd %ymm0, (%rax) > > vmovupd %ymm1, 32(%rax) > > addq$64, %rax > > decl%esi > > jb .L3 > > > > The loop overhead in my variant is 3 x86 instructions==2 macro-ops, > > vs 5 x86 instructions==4 macro-ops in gcc variant. > > Also, in gcc variant all memory accesses have displacement that makes them > > 1 byte longer. In my variant only half of accesses have displacement. > > > > I think, in the past I had seen cases where gcc generates optimal or > > near-optimal > > code sequences for loop overhead. I wonder why it can not do it here. > > I don't think we currently consider IVs based on the difference of two > addresses. It seems to me that I had seen you doing it. But, may be, I confuse gcc with clang. > The cost benefit of no displacement is only size, Size is pretty important in high-IPC SIMD loops. Esp. on Intel and when # of iterations is small, because Intel has 16-byte fetch out of L1I cache. SIMD instructions tend to be long and not many instructions fit within 16 bytes even when memory accesses have no offsets. Offset adds impact to the injury. > otherwise > I have no idea why we have biased the %rax accesses by -32. Why we > fail to consider decrement-to-zero for the counter IV is probably because > IVCANON would add such IV but the vectorizer replaces that and IVOPTs > doesn't consider re-adding that. Sorry, I have no idea about the meaning of IVCANON.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #19 from Michael_S --- (In reply to Alexander Monakov from comment #18) > The apparent 'bias' is introduced by instruction scheduling: haifa-sched > lifts a +64 increment over memory accesses, transforming +0 and +32 > displacements to -64 and -32. Sometimes this helps a little bit even on > modern x86 CPUs. I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Zen1 or later. > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > IVs for the two arrays actually helps on this testcase. Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), %ymm3, %ymm0' would be turned into 2 uops. Misuse of load+op is far bigger problem in this particular test case than sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns loop that can potentially run at 3 clocks per iteration into loop of 4+ clocks per iteration. But I consider it a separate issue. I reported similar issue in 97127, but here it is more serious. It looks to me that the issue is not soluble within existing gcc optimization framework. The only chance is if you accept my old and simple advice - within inner loops pretend that AVX is RISC, i.e. generate code as if load-op form of AVX instructions weren't existing.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #18 from Alexander Monakov --- The apparent 'bias' is introduced by instruction scheduling: haifa-sched lifts a +64 increment over memory accesses, transforming +0 and +32 displacements to -64 and -32. Sometimes this helps a little bit even on modern x86 CPUs. Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be 'unlaminated' (turned to 2 uops before renaming), so selecting independent IVs for the two arrays actually helps on this testcase.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #17 from Richard Biener --- (In reply to Michael_S from comment #16) > On unrelated note, why loop overhead uses so many instructions? > Assuming that I am as misguided as gcc about load-op combining, I would > write it as: > sub %rax, %rdx > .L3: > vmovupd (%rdx,%rax), %ymm1 > vmovupd 32(%rdx,%rax), %ymm0 > vfmadd213pd32(%rax), %ymm3, %ymm1 > vfnmadd213pd (%rax), %ymm2, %ymm0 > vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0 > vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1 > vmovupd %ymm0, (%rax) > vmovupd %ymm1, 32(%rax) > addq$64, %rax > decl%esi > jb .L3 > > The loop overhead in my variant is 3 x86 instructions==2 macro-ops, > vs 5 x86 instructions==4 macro-ops in gcc variant. > Also, in gcc variant all memory accesses have displacement that makes them > 1 byte longer. In my variant only half of accesses have displacement. > > I think, in the past I had seen cases where gcc generates optimal or > near-optimal > code sequences for loop overhead. I wonder why it can not do it here. I don't think we currently consider IVs based on the difference of two addresses. The cost benefit of no displacement is only size, otherwise I have no idea why we have biased the %rax accesses by -32. Why we fail to consider decrement-to-zero for the counter IV is probably because IVCANON would add such IV but the vectorizer replaces that and IVOPTs doesn't consider re-adding that.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #16 from Michael_S --- On unrelated note, why loop overhead uses so many instructions? Assuming that I am as misguided as gcc about load-op combining, I would write it as: sub %rax, %rdx .L3: vmovupd (%rdx,%rax), %ymm1 vmovupd 32(%rdx,%rax), %ymm0 vfmadd213pd32(%rax), %ymm3, %ymm1 vfnmadd213pd (%rax), %ymm2, %ymm0 vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0 vfnmadd231pd (%rdx,%rax), %ymm2, %ymm1 vmovupd %ymm0, (%rax) vmovupd %ymm1, 32(%rax) addq$64, %rax decl%esi jb .L3 The loop overhead in my variant is 3 x86 instructions==2 macro-ops, vs 5 x86 instructions==4 macro-ops in gcc variant. Also, in gcc variant all memory accesses have displacement that makes them 1 byte longer. In my variant only half of accesses have displacement. I think, in the past I had seen cases where gcc generates optimal or near-optimal code sequences for loop overhead. I wonder why it can not do it here.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #15 from Richard Biener --- I can confirm we get .L3: vmovupd (%rsi), %ymm1 vmovupd 32(%rsi), %ymm0 addl$1, %eax addq$64, %rdi addq$64, %rsi vblendpd$14, %ymm1, %ymm0, %ymm3 vblendpd$14, %ymm0, %ymm1, %ymm2 vfnmadd213pd-64(%rdi), %ymm5, %ymm3 vfmadd213pd -32(%rdi), %ymm7, %ymm1 vfnmadd132pd%ymm4, %ymm3, %ymm2 vfnmadd132pd%ymm6, %ymm1, %ymm0 vmovupd %ymm2, -64(%rdi) vmovupd %ymm0, -32(%rdi) cmpl%edx, %eax jb .L3 instead of .L3: vmovupd (%rdx), %ymm1 vmovupd (%rdx), %ymm0 addl$1, %ecx addq$64, %rax vfmadd213pd -32(%rax), %ymm3, %ymm1 vfnmadd213pd-64(%rax), %ymm2, %ymm0 addq$64, %rdx vfnmadd231pd-32(%rdx), %ymm3, %ymm0 vfnmadd231pd-32(%rdx), %ymm2, %ymm1 vmovupd %ymm0, -64(%rax) vmovupd %ymm1, -32(%rax) cmpl%esi, %ecx jb .L3 the good case sees [local count: 214748368]: # ivtmp.27_211 = PHI # ivtmp.32_209 = PHI # ivtmp.34_28 = PHI _53 = (void *) ivtmp.34_28; vect_x_re_54.13_193 = MEM [(const double *)_53]; vect_x_im_60.21_176 = MEM [(const double *)_53 + 32B]; _54 = (void *) ivtmp.32_209; vect_y_re_62.9_200 = MEM [(double *)_54]; vect_y_re_62.10_198 = MEM [(double *)_54 + 32B]; vect__154.17_185 = .FMA (vect_x_re_54.13_193, _197, vect_y_re_62.10_198); vect__66.16_188 = .FNMA (vect_x_re_54.13_193, _196, vect_y_re_62.9_200); vect_y_re_68.23_173 = .FNMA (vect_x_im_60.21_176, _197, vect__66.16_188); vect_y_re_68.23_172 = .FNMA (vect_x_im_60.21_176, _196, vect__154.17_185); MEM [(double *)_54] = vect_y_re_68.23_173; MEM [(double *)_54 + 32B] = vect_y_re_68.23_172; ivtmp.27_210 = ivtmp.27_211 + 1; ivtmp.32_208 = ivtmp.32_209 + 64; ivtmp.34_51 = ivtmp.34_28 + 64; if (bnd.6_207 > ivtmp.27_210) goto ; [90.00%] while the bad has [local count: 214748368]: # ivtmp.31_65 = PHI # ivtmp.36_63 = PHI # ivtmp.38_203 = PHI _61 = (void *) ivtmp.38_203; vect_x_im_60.13_211 = MEM [(const double *)_61]; vect_x_im_60.14_209 = MEM [(const double *)_61 + 32B]; vect_x_re_54.15_208 = VEC_PERM_EXPR ; vect_x_re_54.23_192 = VEC_PERM_EXPR ; _58 = (void *) ivtmp.36_63; vect_y_re_62.9_218 = MEM [(double *)_58]; vect_y_re_62.10_216 = MEM [(double *)_58 + 32B]; vect__41.18_202 = .FMA (vect_x_im_60.13_211, _215, vect_y_re_62.10_216); vect_y_re_68.17_205 = .FNMA (vect_x_re_54.15_208, _214, vect_y_re_62.9_218); vect_y_re_68.25_189 = .FNMA (vect_x_re_54.23_192, _198, vect_y_re_68.17_205); vect_y_re_68.25_188 = .FNMA (_199, vect_x_im_60.14_209, vect__41.18_202); MEM [(double *)_58] = vect_y_re_68.25_189; MEM [(double *)_58 + 32B] = vect_y_re_68.25_188; ivtmp.31_64 = ivtmp.31_65 + 1; ivtmp.36_62 = ivtmp.36_63 + 64; ivtmp.38_59 = ivtmp.38_203 + 64; if (ivtmp.31_64 < bnd.6_225) goto ; [90.00%] the blends do not look like no-ops so I wonder if this is really computing the same thing ... (it swaps lane 0 from the two loads from x but not the stores)
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #14 from Michael_S --- I tested a smaller test bench from Comment 3 with gcc trunk on godbolt. Issue appears to be only partially fixed. -Ofast result is no longer a horror that it was before, but it is still not as good as -O3 or -O2. -Ofast code generation is still strange and there are few vblendpd instruction that serve no useful purpose. And -O2/O3 is still not as good as it should be or as good as icc. But, as mentioned in my original post, over-aggressive load+op combining is a separate problem.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |12.0
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Known to work||12.0 --- Comment #13 from Richard Biener --- Fixed (hopefully), for GCC 12.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #12 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:ce670e4faafb296d1f1a7828d20f8c8ba4686797 commit r12-1329-gce670e4faafb296d1f1a7828d20f8c8ba4686797 Author: Richard Biener Date: Wed Nov 18 14:17:34 2020 +0100 tree-optimization/97832 - handle associatable chains in SLP discovery This makes SLP discovery handle associatable (including mixed plus/minus) chains better by swapping operands across the whole chain. To work this adds caching of the 'matches' lanes for failed SLP discovery attempts, thereby fixing a failed SLP discovery for the slp-pr98855.cc testcase which results in building an operand from scalars as expected. Unfortunately this makes us trip over the cost threshold so I'm XFAILing the testcase for now. For BB vectorization all this doesn't work because we have no way to distinguish good from bad associations as we eventually build operands from scalars and thus not fail in the classical sense. 2021-05-31 Richard Biener PR tree-optimization/97832 * tree-vectorizer.h (_slp_tree::failed): New. * tree-vect-slp.c (_slp_tree::_slp_tree): Initialize failed member. (_slp_tree::~_slp_tree): Free failed. (vect_build_slp_tree): Retain failed nodes and record matches in them, copying that back out when running into a cached fail. Dump start and end of discovery. (dt_sort_cmp): New. (vect_build_slp_tree_2): Handle associatable chains together doing more aggressive operand swapping. * gcc.dg/vect/pr97832-1.c: New testcase. * gcc.dg/vect/pr97832-2.c: Likewise. * gcc.dg/vect/pr97832-3.c: Likewise. * g++.dg/vect/slp-pr98855.cc: XFAIL.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #11 from Richard Biener --- (In reply to Michael_S from comment #10) > I lost track of what you're talking about long time ago. > But that's o.k. No problem - difficult PRs tend to be used as media to brain-dump and record work progress.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #10 from Michael_S --- I lost track of what you're talking about long time ago. But that's o.k.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #9 from Richard Biener --- There's then also a permute optimization left on the plate: t.c:16:3: note: node 0x3a19590 (max_nunits=4, refcnt=2) t.c:16:3: note: stmt 0 _153 = f11_im_76 * x1_im_142; t.c:16:3: note: stmt 1 _213 = f11_re_72 * x1_re_202; t.c:16:3: note: stmt 2 _275 = f11_re_72 * x1_re_264; t.c:16:3: note: stmt 3 _337 = f11_re_72 * x1_re_326; t.c:16:3: note: stmt 4 _155 = f11_im_76 * x1_re_140; t.c:16:3: note: stmt 5 _217 = f11_im_76 * x1_re_202; t.c:16:3: note: stmt 6 _279 = f11_im_76 * x1_re_264; t.c:16:3: note: stmt 7 _341 = f11_im_76 * x1_re_326; t.c:16:3: note: children 0x3a19600 0x3a19670 t.c:16:3: note: node (external) 0x3a19600 (max_nunits=1, refcnt=1) t.c:16:3: note: { f11_im_76, f11_re_72, f11_re_72, f11_re_72, f11_im_76, f11_im_76, f11_im_76, f11_im_76 } t.c:16:3: note: node 0x3a19670 (max_nunits=4, refcnt=1) t.c:16:3: note: stmt 0 x1_im_142 = *_141; t.c:16:3: note: stmt 1 x1_re_202 = *_201; t.c:16:3: note: stmt 2 x1_re_264 = *_263; t.c:16:3: note: stmt 3 x1_re_326 = *_325; t.c:16:3: note: stmt 4 x1_re_140 = *_139; t.c:16:3: note: stmt 5 x1_re_202 = *_201; t.c:16:3: note: stmt 6 x1_re_264 = *_263; t.c:16:3: note: stmt 7 x1_re_326 = *_325; t.c:16:3: note: load permutation { 4 1 2 3 0 1 2 3 } which we currently do not handle (there's a FIXME as to permute externals, currently we only handle splats as transparent for permutes).
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #8 from Richard Biener --- Created attachment 49586 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49586&action=edit prototype This is a prototype patch which can serve as proof-of-concept. It needs cleanup plus better handling of hybrid SLP discovery. It depends on https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559347.html to fix the testcase in this PR (which is included in the patch).
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #7 from Richard Biener --- Or double a[1024], b[1024], c[1024]; void foo() { for (int i = 0; i < 256; ++i) { a[2*i] = 1. - a[2*i] + b[2*i]; a[2*i+1] = 1 + a[2*i+1] - b[2*i+1]; } } which early folding breaks unless we add -fno-associative-math. We then end up with a[_1] = (((b[_1]) - (a[_1])) + 1.0e+0); a[_6] = (((a[_6]) - (b[_6])) + 1.0e+0); where SLP operator swaping cannot handle to bring the grouped loads into the same lanes. So the idea is to look at single-use chains of plus/minus operations and handle those as wide associated SLP nodes with flags denoting which lanes need negation. We'd have three children and each child has a per-lane spec whether to add or subtract.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #6 from Richard Biener --- So for example we'd like to vectorize with SLP when reassociation is permitted (thus with -Ofast for example): double a[1024], b[1024], c[1024]; void foo() { for (int i = 0; i < 256; ++i) { a[2*i] = 1. - a[2*i] + b[2*i]; a[2*i+1] = a[2*i+1] + b[2*i+1] + 1.; } } it again works when written as follows and with -fno-tree-reassoc double a[1024], b[1024], c[1024]; void foo() { for (int i = 0; i < 256; ++i) { a[2*i] = 1. - a[2*i] + b[2*i]; a[2*i+1] = 1 + a[2*i+1] + b[2*i+1]; } }
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #5 from Richard Biener --- OK, so I have a patch to keep the association linear which IMHO is good. It fixes the smaller and my testcase but not the original one which now is linear but still not homogenous. The store groups are as follows *_115 = (*_115) - (f00_re_68 * x0_re_108)) - (f10_re_70 * x1_re_140)) - (f00_im_73 * x0_im_114)) - (f10_im_74 * x1_im_142)); *_117 = (*_117) + (f00_im_73 * x0_re_108)) + (f10_im_74 * x1_re_140)) - (f00_re_68 * x0_im_114)) - (f10_re_70 * x1_im_142)); *_119 = (*_119) - (f01_re_71 * x0_re_108)) - (f11_re_72 * x1_re_140)) - (f01_im_75 * x0_im_114)) - (f11_im_76 * x1_im_142)); *_121 = (*_121) + (f01_im_75 * x0_re_108)) + (f11_im_76 * x1_re_140)) - (f01_re_71 * x0_im_114)) - (f11_re_72 * x1_im_142)); (good) *_177 = (*_177) - (f00_re_68 * x0_re_170)) - (f00_im_73 * x0_im_176)) - (f10_re_70 * x1_re_202)) - (f10_im_74 * x1_im_204)); *_179 = (f00_im_73 * x0_re_170) + (f10_im_74 * x1_re_202)) + (*_179)) - (f00_re_68 * x0_im_176)) - (f10_re_70 * x1_im_204)); *_181 = (*_181) - (f01_re_71 * x0_re_170)) - (f01_im_75 * x0_im_176)) - (f11_re_72 * x1_re_202)) - (f11_im_76 * x1_im_204)); *_183 = (f01_im_75 * x0_re_170) + (f11_im_76 * x1_re_202)) + (*_183)) - (f01_re_71 * x0_im_176)) - (f11_re_72 * x1_im_204)); already bad. Now, this is sth to tackle in the vectorizer which ideally should not try to match up individual adds during SLP discoverly but instead (if association is allowed) the whole addition chain, commutating within the whole change rather than just swapping individual add operands. I still think the reassoc change I came up with is good since it avoids the need to linearlize in the vectorizer. So testing that now.
[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Richard Biener changed: What|Removed |Added Component|target |tree-optimization --- Comment #4 from Richard Biener --- Ah, thanks - that helps. So we're re-associating from *_89 = (((*_89) - (f_re_34 * x_re_82)) - (f_im_35 * x_im_88)); *_91 = (((*_91) + (f_im_35 * x_re_82)) - (f_re_34 * x_im_88)); to *_89 = ((*_89) - ((f_re_34 * x_re_82) + (f_im_35 * x_im_88))); *_91 = (((*_91) + (f_im_35 * x_re_82)) - (f_re_34 * x_im_88)); that makes the operations unbalanced. This is (a - b) - c -> a - (b + c) as we're optimizing this as a + -b + -c. Even smaller testcase: double a[1024], b[1024], c[1024]; void foo() { for (int i = 0; i < 256; ++i) { a[2*i] = a[2*i] + b[2*i] - c[2*i]; a[2*i+1] = a[2*i+1] - b[2*i+1] - c[2*i+1]; } } here ranks end up associating the expr as (-b + -c) + a and negate re-propagation goes (-b - c) + a -> -(b + c) + a -> a - (b + c) which is all sensible in isolation. You could say that associating as (-b + -c) + a is worse than (a + -b) + -c in this respect. Ranks are Rank for _8 is 327683 (a) Rank for _13 is 327684 (-b) Rank for _21 is 327684 (-c) where the rank is one more for the negated values because of the negate operation. While heuristically ignoring negates for rank propagation to make all ranks equal helps this new testcase it doesn't help for the larger two. It might still be a generally sound heuristic improvement though. For the effects on vectorization I think we need to do sth in the vectorizer itself, for example linearizing expressions. The first reassoc pass is supposed to do this but then negate re-propagation undoes it in this case - which maybe points to it that needs fixing, somehow associating a not negated operand first.