[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Bug 84037 depends on bug 85491, which changed state. Bug 85491 Summary: [8 Regression] nbench LU Decomposition test 15% slower than GCC 7, 30% slower than peak https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85491 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #33 from Richard Biener --- Author: rguenth Date: Fri Feb 16 13:47:25 2018 New Revision: 257734 URL: https://gcc.gnu.org/viewcvs?rev=257734&root=gcc&view=rev Log: 2018-02-16 Richard Biener PR tree-optimization/84037 PR tree-optimization/84016 PR target/82862 * config/i386/i386.c (ix86_builtin_vectorization_cost): Adjust vec_construct for the fact we need additional higher latency 128bit inserts for AVX256 and AVX512 vector builds. (ix86_add_stmt_cost): Scale vector construction cost for elementwise loads. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #32 from Richard Biener --- Fixed.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #31 from Richard Biener --- (In reply to Jakub Jelinek from comment #30) > Is this fixed now or is there more work to do? I'm performance testing some more fixes.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #30 from Jakub Jelinek --- Is this fixed now or is there more work to do?
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #29 from Richard Biener --- Author: rguenth Date: Mon Feb 12 13:55:04 2018 New Revision: 257588 URL: https://gcc.gnu.org/viewcvs?rev=257588&root=gcc&view=rev Log: 2018-02-12 Richard Biener PR tree-optimization/84037 * tree-vect-slp.c (vect_analyze_slp_cost): Add visited parameter, move visited init to caller. (vect_slp_analyze_operations): Separate cost from validity check, initialize visited once for all instances. (vect_schedule_slp): Analyze map to CSE vectorized nodes once for all instances. * tree-vect-stmts.c (vect_model_simple_cost): Make early out an assert. (vect_model_promotion_demotion_cost): Likewise. (vectorizable_bswap): Guard cost modeling with !slp_node instead of !PURE_SLP_STMT to avoid double-counting on hybrid SLP stmts. (vectorizable_call): Likewise. (vectorizable_conversion): Likewise. (vectorizable_assignment): Likewise. (vectorizable_shift): Likewise. (vectorizable_operation): Likewise. (vectorizable_store): Likewise. (vectorizable_load): Likewise. (vectorizable_condition): Likewise. (vectorizable_comparison): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c trunk/gcc/tree-vect-stmts.c
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #28 from Richard Biener --- Author: rguenth Date: Mon Feb 12 08:54:28 2018 New Revision: 257581 URL: https://gcc.gnu.org/viewcvs?rev=257581&root=gcc&view=rev Log: 2018-02-12 Richard Biener PR tree-optimization/84037 * tree-vect-slp.c (vect_build_slp_tree_2): Try swapping the matched stmts if we cannot swap the non-matched ones. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #27 from Richard Biener --- (In reply to amker from comment #26) > (In reply to amker from comment #25) > > I tend to believe this is an register pressure based strength-reduction + > > lim problem than ivopts. > > > > So given class of memory references like: > > > > reg = ... > > Loop: > > MEM[iv_base + reg * 0]; > > MEM[iv_base + reg * 1]; > > MEM[iv_base + reg * 2]; > > MEM[iv_base + reg * 3]; > > MEM[iv_base + reg * 4]; > > MEM[iv_base + reg * 5]; > > MEM[iv_base + reg * 6]; > > MEM[iv_base + reg * 7]; > > > > The best arrangement probably would be: > > > > reg = ... > > regX = reg * 3; > > Loop: > > MEM[iv_base + reg * 0]; > > MEM[iv_base + reg * 1]; > > MEM[iv_base + reg * 2]; > > t = iv_base + regX; > > MEM[t]; > > MEM[iv_base + reg * 4]; > > MEM[t + reg * 2]; > > MEM[iv_base + regX * 2]; > > MEM[t + reg * 4]; > > > > Depending on the register pressure, regX should be re-materialized in loop > > or hoisted as invariant. > > Of course supported scales in addressing modes should be considered, but I > guess most targets only (if) support [base + reg * 2^x]? Yes. Note that the input to IVOPTs is the strength-reduced variant and IVOPTs generates the variant with multiple hoisted invariants. The IVOPTs issue I mentioned happens when you compare the following two functions - for foo it ends up with two IVs (good). float A[1024]; float B[1024]; void foo(int s, int r) { for (int i = 0; i < 128; i+=4) { B[(i+0)*r + 0] = A[(i+0)*s + 0]; B[(i+0)*r + 1] = A[(i+0)*s + 1]; B[(i+1)*r + 0] = A[(i+1)*s + 0]; B[(i+1)*r + 1] = A[(i+1)*s + 1]; B[(i+2)*r + 0] = A[(i+2)*s + 0]; B[(i+2)*r + 1] = A[(i+2)*s + 1]; B[(i+3)*r + 0] = A[(i+3)*s + 0]; B[(i+3)*r + 1] = A[(i+3)*s + 1]; } } void bar(int s, int r) { for (int i = 0; i < 128; i+=4) { float *b = &B[i*r]; float *a = &A[i*s]; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; } } The vectorizer actually emits sth like the following but it behaves like bar. void baz(int s, int r) { float *IVb = &B[0]; float *IVa = &A[0]; for (int i = 0; i < 128; i+=4) { float *b = IVb; float *a = IVa; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; b += r; a += s; b[0] = a[0]; b[1] = a[1]; IVb += 4*r; IVa += 4*s; } }
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #26 from amker at gcc dot gnu.org --- (In reply to amker from comment #25) > I tend to believe this is an register pressure based strength-reduction + > lim problem than ivopts. > > So given class of memory references like: > > reg = ... > Loop: > MEM[iv_base + reg * 0]; > MEM[iv_base + reg * 1]; > MEM[iv_base + reg * 2]; > MEM[iv_base + reg * 3]; > MEM[iv_base + reg * 4]; > MEM[iv_base + reg * 5]; > MEM[iv_base + reg * 6]; > MEM[iv_base + reg * 7]; > > The best arrangement probably would be: > > reg = ... > regX = reg * 3; > Loop: > MEM[iv_base + reg * 0]; > MEM[iv_base + reg * 1]; > MEM[iv_base + reg * 2]; > t = iv_base + regX; > MEM[t]; > MEM[iv_base + reg * 4]; > MEM[t + reg * 2]; > MEM[iv_base + regX * 2]; > MEM[t + reg * 4]; > > Depending on the register pressure, regX should be re-materialized in loop > or hoisted as invariant. Of course supported scales in addressing modes should be considered, but I guess most targets only (if) support [base + reg * 2^x]?
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #25 from amker at gcc dot gnu.org --- I tend to believe this is an register pressure based strength-reduction + lim problem than ivopts. So given class of memory references like: reg = ... Loop: MEM[iv_base + reg * 0]; MEM[iv_base + reg * 1]; MEM[iv_base + reg * 2]; MEM[iv_base + reg * 3]; MEM[iv_base + reg * 4]; MEM[iv_base + reg * 5]; MEM[iv_base + reg * 6]; MEM[iv_base + reg * 7]; The best arrangement probably would be: reg = ... regX = reg * 3; Loop: MEM[iv_base + reg * 0]; MEM[iv_base + reg * 1]; MEM[iv_base + reg * 2]; t = iv_base + regX; MEM[t]; MEM[iv_base + reg * 4]; MEM[t + reg * 2]; MEM[iv_base + regX * 2]; MEM[t + reg * 4]; Depending on the register pressure, regX should be re-materialized in loop or hoisted as invariant.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Bug 84037 depends on bug 84278, which changed state. Bug 84278 Summary: claims initv4sfv2sf is available but inits through stack https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84278 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #24 from Richard Biener --- (In reply to amker from comment #23) > (In reply to Richard Biener from comment #21) > > So after r257453 we improve the situation pre-IVOPTs to just > > 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then > > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us > > spill... (for AVX256, so you need -march=core-avx2 for example). > > > > Bin, any chance you can take a look? In the IVO dump I see > > > > target_avail_regs 15 > > target_clobbered_regs 9 > > target_reg_cost 4 > > target_spill_cost 8 > > regs_used 3 > > ^^^ > > > > and regs_used looks awfully low to me. The loop has even more IVs initially > > plus variable steps for that IVs which means we need two regs per IV. > > > > There doesn't seem to be a way to force IVOPTs to use the minimal set of > > IVs? > > Or just use the original set, removing the obvious redundancies? There is > > a microarchitectural issue left with the vectorization but the spilling > > obscures the look quite a bit :/ > > Sure, I will have a look based on your commit. Thanks Note the loop in question is the one starting at line 551, it gets inlined multiple times but the issue is visible with -fno-inline as well. -mavx2 makes things worse (compared to -mavx2 -mprefer-avx128) because for the strided accesses we choose to compute extra invariants for the two strides of A and E. For SSE we keep stride and stride * 3 while for AVX we additionally compute stride * 5, stride * 6 and stride * 7 (in the cases we don't choose another base IV). At least computing stride * 6 can be avoided by using stride * 3 with step 2 - but it's probably too hard to see that within the current IVO model? I'm not sure avoiding an invariant in exchange for an extra IV is ever a good idea? Spilling an invariant should be cheaper than spilling an IV - but yes, the addressing mode possibly looks to offset for any bias we apply there. Note the vectorizer itself tries to avoid computing stride * N by strength-reducing it: _711 = (sizetype) iftmp.472_91; _712 = _711 * 64; _715 = (sizetype) iftmp.472_91; _716 = _715 * 8; ... # ivtmp_891 = PHI ... _893 = MEM[(real(kind=4) *)ivtmp_891]; ivtmp_894 = ivtmp_891 + _716; _895 = MEM[(real(kind=4) *)ivtmp_894]; ivtmp_896 = ivtmp_894 + _716; _897 = MEM[(real(kind=4) *)ivtmp_896]; ivtmp_898 = ivtmp_896 + _716; _899 = MEM[(real(kind=4) *)ivtmp_898]; ivtmp_900 = ivtmp_898 + _716; _901 = MEM[(real(kind=4) *)ivtmp_900]; ivtmp_902 = ivtmp_900 + _716; _903 = MEM[(real(kind=4) *)ivtmp_902]; ivtmp_904 = ivtmp_902 + _716; _905 = MEM[(real(kind=4) *)ivtmp_904]; ivtmp_906 = ivtmp_904 + _716; _907 = MEM[(real(kind=4) *)ivtmp_906]; vect_cst__909 = {_893, _895, _897, _899, _901, _903, _905, _907}; ... ivtmp_892 = ivtmp_891 + _712; note how it advances the IV in one step at the end though. Not sure if IVO is confused by that or the way we compute _716 vs. _712. That said, the summary is that IVO behavior with unrolled loop bodies with variable stride isn't helping here ;)
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #23 from amker at gcc dot gnu.org --- (In reply to Richard Biener from comment #21) > So after r257453 we improve the situation pre-IVOPTs to just > 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us > spill... (for AVX256, so you need -march=core-avx2 for example). > > Bin, any chance you can take a look? In the IVO dump I see > > target_avail_regs 15 > target_clobbered_regs 9 > target_reg_cost 4 > target_spill_cost 8 > regs_used 3 > ^^^ > > and regs_used looks awfully low to me. The loop has even more IVs initially > plus variable steps for that IVs which means we need two regs per IV. > > There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs? > Or just use the original set, removing the obvious redundancies? There is > a microarchitectural issue left with the vectorization but the spilling > obscures the look quite a bit :/ Sure, I will have a look based on your commit. Thanks
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #22 from Richard Biener --- Author: rguenth Date: Wed Feb 7 15:46:17 2018 New Revision: 257453 URL: https://gcc.gnu.org/viewcvs?rev=257453&root=gcc&view=rev Log: 2018-02-07 Richard Biener PR tree-optimization/84037 * tree-vectorizer.h (struct _loop_vec_info): Add ivexpr_map member. (cse_and_gimplify_to_preheader): Declare. (vect_get_place_in_interleaving_chain): Likewise. * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize ivexpr_map. (_loop_vec_info::~_loop_vec_info): Delete it. (cse_and_gimplify_to_preheader): New function. * tree-vect-slp.c (vect_get_place_in_interleaving_chain): Export. * tree-vect-stmts.c (vectorizable_store): CSE base and steps. (vectorizable_load): Likewise. For grouped stores always base the IV on the first element. * tree-vect-loop-manip.c (vect_loop_versioning): Unshare versioning condition before gimplifying. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-loop-manip.c trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vect-slp.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #21 from Richard Biener --- So after r257453 we improve the situation pre-IVOPTs to just 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us spill... (for AVX256, so you need -march=core-avx2 for example). Bin, any chance you can take a look? In the IVO dump I see target_avail_regs 15 target_clobbered_regs 9 target_reg_cost 4 target_spill_cost 8 regs_used 3 ^^^ and regs_used looks awfully low to me. The loop has even more IVs initially plus variable steps for that IVs which means we need two regs per IV. There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs? Or just use the original set, removing the obvious redundancies? There is a microarchitectural issue left with the vectorization but the spilling obscures the look quite a bit :/
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #20 from Richard Biener --- Note that targets already have the opportunity to limit vectorization by adjusting their finish_cost hook - here they even have more useful information available (kind of).
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #19 from Richard Biener --- On Zen I measure 23s with --param vect-max-version-for-alias-checks=0 (thus basically before the rev.) and 33s without. With the patch and the size parameter tuned to 146 I get 25s and with 90 it is 22.5s. So Zen seems to be a bit pickier (code-size wise) than Broadwell.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #18 from Richard Biener --- (In reply to Jan Hubicka from comment #17) > We already have > /* This function adjusts the unroll factor based on >the hardware capabilities. For ex, bdver3 has >a loop buffer which makes unrolling of smaller >loops less important. This function decides the >unroll factor using number of memory references >(value 32 is used) as a heuristic. */ > > static unsigned > ix86_loop_unroll_adjust (unsigned nunroll, struct loop *loop) > > which triggers with TARGET_ADJUST_UNROLL > /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based > >on hardware capabilities. Bdver3 hardware has a loop buffer which makes > >unrolling small loop less important. For, such architectures we adjust > >the unroll factor so that the unrolled loop fits the loop buffer. */ > > DEF_TUNE (X86_TUNE_ADJUST_UNROLL, "adjust_unroll_factor", m_BDVER3 | > m_BDVER4) > > so perhaps what you propose can be done by making this one more general? Hmm. Both i386 and s390x expect the function to be in RTL form for this hook so it doesn't apply on GIMPLE (and gimple unrolling). If we'd make it more general, thus handle both IL forms and instead of returning an unroll factor return a maximum growth factor, not maximum size to avoid comparing apples (size unit of target) and oranges (size unit of consumer) then it could indeed work. So sth like /* Return the maximum percentage LOOP is allowed to grow to or -1U for no target specific constraints. */ unsigned targetm.loop_growth_constraint (loop *loop); the existing uses in loop-unroll.c if (targetm.loop_unroll_adjust) nunroll = targetm.loop_unroll_adjust (nunroll, loop); would then become sth like if (targetm.loop_unroll_adjust) nunroll = MIN (nunroll, targetm.loop_unroll_adjust (loop) / 100); and existing hooks would have an early out if (in_gimple_form) return -1U; and their current return value simply multiplied by 100? Note the current x86 implementation does /* Count the number of memory references within the loop body. This value determines the unrolling factor for bdver3 and bdver4 architectures. */ ... if (mem_count && mem_count <=32) return 32/mem_count; return nunroll; and thus returns 32 for mem_count == 1 and any nunroll specified by the caller (like if it specified 2 because of some user --param). That looks bougs to me and we instead want to return MIN (nunroll, 32 / mem_count)? The s390 implementation gets this right.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #17 from Jan Hubicka --- We already have /* This function adjusts the unroll factor based on the hardware capabilities. For ex, bdver3 has a loop buffer which makes unrolling of smaller loops less important. This function decides the unroll factor using number of memory references (value 32 is used) as a heuristic. */ static unsigned ix86_loop_unroll_adjust (unsigned nunroll, struct loop *loop) which triggers with TARGET_ADJUST_UNROLL /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based on hardware capabilities. Bdver3 hardware has a loop buffer which makes unrolling small loop less important. For, such architectures we adjust the unroll factor so that the unrolled loop fits the loop buffer. */ DEF_TUNE (X86_TUNE_ADJUST_UNROLL, "adjust_unroll_factor", m_BDVER3 | m_BDVER4) so perhaps what you propose can be done by making this one more general?
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #16 from Richard Biener --- So discussion lead to the proposal to add another unroll parameter, for example --param small-loop-size which serves as a "barrier" we may not cross when optimizing a loop. Thus for all loops <= small-loop-size before the transform inhibit the transform from growing the loop to > small-loop-size. Limit other loops like before. The motivation is to model things like the loop-stream-detector or uop caches where falling out of either has a severe performance impact. That's without trying to combine this new heuristic with the computed speedup by unrolling/vectorization. The default of the parameter would be zero (disable that feature) and targets could set it based on their micro-architecture and benchmarking. It should _always_ be smaller than --param max-unrolled-insns. Sounds sane?
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #15 from Richard Biener --- Oh, and if you don't disable inlining then you get down to sizes of 148 (SSE and SLP) and 91 and 75 (SSE and no SLP). So you won't get rid of two instances of vectorization regardless of the parameter (for size 75 I don't apply PARAM_MAX_UNROLLED_INSNS because it's not at least one full unroll copy when looking at the scalar body size of 43). With the default param we inhibit use of SLP in capacita2.f90:226:0: note: estimated vector body size is 19, scalar body size 2 capacita2.f90:226:0: note: not vectorized: loop grows too much. capacita2.f90:226:0: note: estimated vector body size is 11, scalar body size 2 capacita2.f90:226:0: note: loop vectorized capacita2.f90:259:0: note: estimated vector body size is 19, scalar body size 2 capacita2.f90:259:0: note: not vectorized: loop grows too much. capacita2.f90:259:0: note: estimated vector body size is 11, scalar body size 2 capacita2.f90:259:0: note: loop vectorized in addition to the critical loop (copies) at 551: capacita2.f90:551:0: note: estimated vector body size is 298, scalar body size 43 capacita2.f90:551:0: note: not vectorized: loop grows too much. capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size 43 capacita2.f90:551:0: note: not vectorized: loop grows too much. capacita2.f90:551:0: note: estimated vector body size is 147, scalar body size 43 capacita2.f90:551:0: note: loop vectorized capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size 43 capacita2.f90:551:0: note: not vectorized: loop grows too much. capacita2.f90:551:0: note: estimated vector body size is 147, scalar body size 43 capacita2.f90:551:0: note: loop vectorized capacita2.f90:551:0: note: estimated vector body size is 258, scalar body size 43 capacita2.f90:551:0: note: not vectorized: loop grows too much. capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size 43 capacita2.f90:551:0: note: not vectorized: loop grows too much. capacita2.f90:551:0: note: estimated vector body size is 168, scalar body size 43 capacita2.f90:551:0: note: loop vectorized I do think that applying this sort of heuristic makes sense, even if it doesn't help the polyhedron case. Numbers for different values of the parameter are 300 (w/o patch) 20.91user 0.05system 0:20.97elapsed 200 (default)20.98user 0.08system 0:21.07elapsed 99%CPU 147 19.62user 0.06system 0:19.70elapsed 99%CPU 146 17.27user 0.08system 0:17.36elapsed 99%CPU 140 17.19user 0.06system 0:17.26elapsed 99%CPU 91 17.41user 0.05system 0:17.48elapsed 99%CPU 90 16.98user 0.05system 0:17.04elapsed 99%CPU 75 17.01user 0.04system 0:17.06elapsed 99%CPU 74 16.93user 0.06system 0:16.99elapsed 99%CPU 117.02user 0.06system 0:17.08elapsed 100%CPU the sweet spot for this benchmark seems to be 146... For reference with -fno-tree-vectorize I get 18.36user 0.09system 0:18.45elapsed 99%CPU with --param vect-max-version-for-alias-checks=0 16.92user 0.06system 0:16.99elapsed 99%CPU
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #14 from Richard Biener --- Created attachment 43289 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43289&action=edit patch limiting growth So I played with a simple hack limiting the amount of growth in a vectorized loop based on our exisiting unroller parameters. This causes AVX256 vectorization to be rejected on that round with an estimated scalar loop body size of 43 and a vector loop body size of 289 (with SLP and AVX256), 259 (without SLP and AVX256) . But we then still accept vectorization with SSE (vector size with SLP 147). I simply count the number of stmts costed for the estimate and added if (vect_body_size / scalar_size > PARAM_VALUE (PARAM_MAX_UNROLL_TIMES) || ((vect_body_size / scalar_size > 1) && (vect_body_size > PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not vectorized: loop grows too much.\n"); return -1; } where PARAM_MAX_UNROLLED_INSNS is 200 at the moment. With SSE vectorization the LSD.UOPS counter still doesn't trigger on the loop so it's still too big (it's 958 bytes, the AVX2 variant is 1365 bytes, the scalar loop is already 363 bytes). So to fix the regression we'd need to lower the unroll parameter to 146.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Richard Biener changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #13 from Richard Biener --- So performance counters on my Broadwell machine say that there are zero hits for this vectorized loop with LSD.UOPS while there are very many for the scalar case. This means the loop body is too large to trigger the LSD (and probably also to fit the uop cache). One of the issue is that we require 4 registers for the indexes into the loads vmovq (%rax,%r11,2), %xmm7 vpinsrq $1, (%rax,%r13), %xmm7, %xmm4 vmovq (%rax), %xmm7 vpinsrq $1, (%rax,%r11), %xmm7, %xmm9 vmovq (%rax,%r15), %xmm7 vpinsrq $1, (%rax,%r12), %xmm7, %xmm3 vmovq (%rax,%r11,4), %xmm7 vpinsrq $1, (%rax,%r14), %xmm7, %xmm1 vinserti128 $0x1, %xmm4, %ymm9, %ymm9 from _1507 = (void *) ivtmp.760_1462; _792 = MEM[base: _1507, offset: 0B]; _1508 = (void *) ivtmp.760_1462; _794 = MEM[base: _1508, index: _331, offset: 0B]; _1509 = (void *) ivtmp.760_1462; _796 = MEM[base: _1509, index: _331, step: 2, offset: 0B]; _1510 = (void *) ivtmp.760_1462; _1511 = _331 * 3; _798 = MEM[base: _1510, index: _1511, offset: 0B]; _1512 = (void *) ivtmp.760_1462; _800 = MEM[base: _1512, index: _331, step: 4, offset: 0B]; _1513 = (void *) ivtmp.760_1462; _1514 = _331 * 5; _802 = MEM[base: _1513, index: _1514, offset: 0B]; _1515 = (void *) ivtmp.760_1462; _1516 = _331 * 6; _804 = MEM[base: _1515, index: _1516, offset: 0B]; _1517 = (void *) ivtmp.760_1462; _1518 = _331 * 7; _806 = MEM[base: _1517, index: _1518, offset: 0B]; vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806}; where IVOPTs did a reasonable job. Later LIM hoists all the invariant _311 * N indexes. And IVOPTs failed to realize that _331 * 3 can be used for _331 * 6 by using step == 2. But in the end the register optimal decision is probably to strength-reduce this (the vectorizer generates strength-reduced code). We do end up spilling most IVs in this loop.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #12 from Richard Biener --- I have opened PR84102 for the missed optimizations in this particular loop. I believe now the interesting one is the other. 30.25% a.outa.out [.] __solv_cap_MOD_fourir2d 24.83% a.outa.out [.] __solv_cap_MOD_fourir 24.83% a.outa.out [.] __solv_cap_MOD_fourir2dx 18.78% a.out[unknown] [k] 0x813366e7 and the 551 loops are in the innermost nest of fourir() which is called/inlined to fourir*. It's actually not array expressions but the different inline copies we vectorize. Cost model for that one: capacita2.f90:551:0: note: Cost model analysis: Vector inside of loop cost: 3756 Vector prologue cost: 64 Vector epilogue cost: 1712 Scalar iteration cost: 516 Scalar outside cost: 4 Vector outside cost: 1776 prologue iterations: 0 epilogue iterations: 4 Calculated minimum iters for profitability: 0 capacita2.f90:551:0: note: Runtime profitability threshold = 8 capacita2.f90:551:0: note: Static estimate profitability threshold = 11545608 the loop is hybrid SLP, VF is 8 capacita2.f90:551:0: note: improved number of alias checks from 36 to 6 so here it's also dependence analysis breaking down because the step is unknown: (compute_affine_dependence stmt_a: _170 = REALPART_EXPR <*a.0_107[_54]>; stmt_b: _196 = REALPART_EXPR <*a.0_107[_73]>; (analyze_overlapping_iterations (chrec_a = 0) (chrec_b = 0) (overlap_iterations_a = [0]) (overlap_iterations_b = [0])) (analyze_overlapping_iterations (chrec_a = {((integer(kind=8)) j0_119 + 1) * iftmp.476_91, +, iftmp.476_91}_3) (chrec_b = {((integer(kind=8)) j1_124 + 1) * iftmp.476_91, +, iftmp.476_91}_3) (analyze_siv_subscript siv test failed: unimplemented) (overlap_iterations_a = not known) (overlap_iterations_b = not known)) ) -> dependence analysis failed the only thing we know about iftmp.476_91 is that it isn't zero (but I'm sure dependence analysis doesn't use that fact). I think this specific dependence should be computable...? [by just dividing both chrecs by iftmp.476_91?]
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #11 from Richard Biener --- So probably the big slowdown is because the vectorized loop body is so much larger. Unvectorized: .L61: vmulss __solv_cap_MOD_d1(%rip), %xmm4, %xmm0 incl%ecx vmulss (%rdx), %xmm0, %xmm0 vmovss %xmm0, (%rdx) addq%rax, %rdx cmpl%r12d, %ecx jne .L61 vectorized (see how we hoist the load), with -march=haswell: vmulss __solv_cap_MOD_d1(%rip), %xmm4, %xmm5 movq144(%rsp), %rdi leaq(%rbx,%r13), %rdx xorl%r10d, %r10d movq%rdx, %rsi leaq0(%r13,%rdi), %rcx movq%rcx, %rdi vbroadcastss%xmm5, %ymm5 .p2align 4,,10 .p2align 3 .L58: vmovss (%rcx,%rax,2), %xmm1 vmovss (%rsi,%rax,2), %xmm0 incl%r10d vinsertps $0x10, (%rcx,%r8), %xmm1, %xmm3 vinsertps $0x10, (%rsi,%r8), %xmm0, %xmm7 vmovss (%rcx), %xmm1 vmovss (%rsi), %xmm0 vinsertps $0x10, (%rcx,%rax), %xmm1, %xmm1 vinsertps $0x10, (%rsi,%rax), %xmm0, %xmm0 addq%r9, %rcx addq%r9, %rsi vmovlhps%xmm7, %xmm0, %xmm0 vmovlhps%xmm3, %xmm1, %xmm1 ^^ not sure why we construct in such strange way - ICC simply does a single vmovss and then 7 vinsertps vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vmulps %ymm5, %ymm0, %ymm0 vmovss %xmm0, (%rdx) vextractps $1, %xmm0, (%rdx,%rax) vextractps $2, %xmm0, (%rdx,%rax,2) vextractps $3, %xmm0, (%rdx,%r8) vextractf128$0x1, %ymm0, %xmm0 addq%r9, %rdx vmovss %xmm0, (%rdi) vextractps $1, %xmm0, (%rdi,%rax) vextractps $2, %xmm0, (%rdi,%rax,2) vextractps $3, %xmm0, (%rdi,%r8) Similar here. But fixing this would only reduce this by a few stmts. addq%r9, %rdi cmpl%r10d, %r14d jne .L58 Anyway, this size (140 bytes, 9 cache lines) probably blows any loop stream cache limits (IIRC that was around 3 cache lines). Compared to 26 bytes for the scalar version (2 cache lines). Any such considerations would be best placed in the targets finish_cost hook where the target knows all stmts that are going to be emitted and can in theory also cost against the scalar variant (not easily available). The SSE variant is smaller so measuring the slowdown with SSE only would be interesting. Hmm, SSE variant is slower (for all of capacita) but -fno-tree-vectorize is fastest.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #10 from Richard Biener --- So strided stores are costed as /* Costs of the stores. */ if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_GATHER_SCATTER) { /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); inside_cost += record_stmt_cost (body_cost_vec, ncopies * assumed_nunits, scalar_store, stmt_info, 0, vect_body); } ... if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) { /* N scalar stores plus extracting the elements. */ unsigned int assumed_nunits = vect_nunits_for_cost (vectype); inside_cost += record_stmt_cost (body_cost_vec, ncopies * assumed_nunits, vec_to_scalar, stmt_info, 0, vect_body); } there's the issue of "overloading" vec_to_scalar with extraction. It's costed as generic sse_op which IMHO is reasonable here (vextract*). The scalar cost is 12 for each of the following stmts _66 = *_150[_65]; d1.76_67 = d1; _160 = d1.76_67 * _73; _74 = _66 * _160; *_150[_65] = _74; the vector variant is adding the construction/extraction cost compared to the scalar variant and wins with the two multiplications being costed once instead of four times. We don't actually factor in the "win" by hoisting the vectorized load of 'd1' only in the vector case. With AVX2 things become even more "cheap" vectorized. And we of course peel the epilogue completely. Ideally we'd interchange this specific loop but interchange doesn't do anything here because we get niters that might be zero. Later dependences would probably wreck things but here this also is a missed optimization. We have two paths running into the loop loading ng1 and checking it against zero properly but the PHI result doesn't have this range info merged (well, VRP sets the info but it needs LIM / PRE to see the opportunity so it's only set by late VRP).
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #9 from Richard Biener --- (In reply to Martin Liška from comment #7) > (In reply to Jakub Jelinek from comment #6) > > Is it really r256643 and not r256644 that is causing this though? > > Yes, I can verify that it's r256644 that's causing the regression. This means that those newly vectorized loops make capacita slower. 551 is do m=1,n/4-1 h = A(j0+m) + A(j2+m)*E(m*inc) A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc) A(j0+m) = h eh = conjg(E(ntot/4-m*inc)) h = A(j1+m) - A(j3+m)*eh A(j3+m) = A(j1+m) + A(j3+m)*eh A(j1+m) = h end do but it's actually the loops from the array expressions I guess (receiving "interesting" locations). 105 is do i=1,Ng1 ! .. and multiply charge with x ->do j=1,Ng2 X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp) end do end do the variable strides are because those are arrays accessed via array descriptors. This means that the stride will be very likely one so any "strided XY" vectorization will have quite a big overhead. Of course in the end it looks like a cost model issue (which might be just not enough factoring in of the alias runtime check). Like in the 105 case I expect the non-vectorized loop to be a quite "nice" optimized nest with IVO doing a good job, etc.. If the inner loop is vectorized conditionally this can wreck code generation and runtime quite a bit. Ng1/Ng2 are 1024 (at runtime, read from capacita.in). For 105 we have capacita.f90:105:0: note: need run-time check that (ssizetype) ((sizetype) prephitmp_341 * 4) is nonzero capacita.f90:105:0: note: versioning for alias required: can't determine dependence between d1 and *_150[_65] so its two checks needed. capacita.f90:105:0: note: Cost model analysis: Vector inside of loop cost: 172 Vector prologue cost: 60 Vector epilogue cost: 136 Scalar iteration cost: 60 Scalar outside cost: 8 Vector outside cost: 196 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 I think the bug is that we're somehow thinking the vectorized arithmetic (two multiplications) offset the use of strided loads and stores... Testcase for this loop: module solv_cap implicit none public :: solveP integer, parameter, public :: dp = selected_real_kind(5) real(kind=dp), private :: Pi, eps0 real(kind=dp), private :: D1, D2 integer, private, save :: Ng1=0, Ng2=0 integer, private, pointer, dimension(:,:) :: Grid contains subroutine solveP(P) real(kind=dp), intent(out) :: P real(kind=dp), allocatable, dimension(:,:) :: Y0, X integer :: i,j allocate( Y0(Ng1,Ng2), X(Ng1,Ng2) ) do i=1,Ng1 do j=1,Ng2 Y0(i,j) = D1 * (i-(Ng1+1)/2.0_dp) * Grid(i,j) end do end do ! RHS for in-field E_x=1. V = -V_in = x, where metal on grid call solve( X, Y0 ) X = X - sum(X)/size(X) ! get rid of monopole term .. do i=1,Ng1 ! .. and multiply charge with x do j=1,Ng2 X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp) end do end do P = sum(X)*D1*D2 * 4*Pi*eps0 ! E-dipole moment in 1 V/m field deallocate( X, Y0 ) return end subroutine solveP end module solv_cap
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 --- Comment #8 from Jan Hubicka --- https://gcc.opensuse.org/gcc-old/c++bench-czerny/pb11/pb11-summary.txt-2-0.html runs with -Ofast -funroll-loops so indeed does not seem essential to trigger the regression (it may be two different ones of course) note that also tfft,aermod and tfft2 shows regression that may be easy to track/fix.
[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037 Martin Liška changed: What|Removed |Added Summary|[8 Regression] Speed|[8 Regression] Speed |regression of polyhedron|regression of polyhedron |benchmark since r256643 |benchmark since r256644 --- Comment #7 from Martin Liška --- (In reply to Jakub Jelinek from comment #6) > Is it really r256643 and not r256644 that is causing this though? Yes, I can verify that it's r256644 that's causing the regression.