[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #11 from Richard Biener --- Just an update on costs: t.c:1:35: note: === vect_compute_single_scalar_iteration_cost === 0x483e120 *_3 1 times scalar_load costs 12 in body 0x483e120 _4 + r_16 1 times scalar_stmt costs 12 in body and the vector body cost: 0x492f9d0 *_3 1 times unaligned_load (misalign -1) costs 20 in body 0x492f9d0 _4 + r_16 8 times vec_to_scalar costs 32 in body 0x492f9d0 _4 + r_16 8 times scalar_stmt costs 96 in body That results in the overall (and sensible) t.c:1:35: note: Cost model analysis: Vector inside of loop cost: 148 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 24 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 where for one vector iteration we have 8 scalar iterations, thus 24 * 8 = 192 As mentioned elsewhere the vectorizer cost model does not care for pipeline latency or dependency issues nor execution resources competition. It also does not care for loop size (the vector loop has one stmt more than the unrolled scalar loop for example). I once played with limiting the vectorization loop growth with the unroll parameters, but we're far from hitting those here. Btw, a microbenchmark shows the loops execute in about the same time vectorized with -mavx2 compared to scalar and not unrolled. When the scalar loop is unrolled 8 times the runtime is the same again (this is all benchmarked on a Haswell machine). If you disregard noise then the scalar unrolled loop is maybe a tid bit faster than the other cases. I believe the limiting factor is the dependence chain of the adds, there's plenty of parallel execution resources to cope for uglyness and friends. This leaves the code bloat as regression I think.
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.3 |8.4 --- Comment #10 from Jakub Jelinek --- GCC 8.3 has been released.
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #9 from Richard Biener --- Split out the target cost issue to PR89114, it'll improve code-gen for the unwanted vectorization a bit at least. That's independent on the vectorizer cost issue.
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Segher Boessenkool changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #8 from Segher Boessenkool --- (In reply to Richard Biener from comment #5) > So combine can see [ snip, 11 ] > with its uses [ snip, 13 and 25 ] > but somehow it only tries 11 -> 13: combine only tries to combine something with its first use. Trying second (or third, etc.) uses as well would easily take exponential time complexity. I do however want combine to try to combine an insn together with its first two uses. That is just as linear as even simple 1+1 combinations, and it is likely to succeed (in fact there is at least one other PR where I wanted this).
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #7 from Richard Biener --- Author: rguenth Date: Fri Jan 25 12:46:24 2019 New Revision: 268264 URL: https://gcc.gnu.org/viewcvs?rev=268264=gcc=rev Log: 2019-01-25 Richard Biener PR tree-optimization/89049 * tree-vect-loop.c (vect_compute_single_scalar_iteration_cost): Look at the pattern stmt to determine if the stmt is vectorized. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-loop.c
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Richard Biener changed: What|Removed |Added CC|segher at gcc dot gnu.org | --- Comment #6 from Richard Biener --- From a quick look rtx_cost should end up recursing to the MEM. Oh. targetm.rtx_costs is expected to handle sub-costs but does case VEC_SELECT: case VEC_CONCAT: case VEC_DUPLICATE: /* ??? Assume all of these vector manipulation patterns are recognizable. In which case they all pretty much have the same cost. */ *total = cost->sse_op; return true; which is of course bogus for patterns involving MEMs (new possibility with AVX). Not combines issue.
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Richard Biener changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #5 from Richard Biener --- So combine can see (insn 11 10 13 3 (set (reg:V8SF 105) (vec_concat:V8SF (reg:V4SF 106 [ MEM[base: _2, offset: 0B] ]) (mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ]) (const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16 A32]))) "t.c":1:72 5046 {avx_vec_concatv8sf} (nil)) with its uses (insn 13 11 14 3 (set (reg:V4SF 107) (vec_select:V4SF (reg:V8SF 105) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) ]))) 2702 {vec_extract_lo_v8sf} (nil)) (insn 25 24 26 3 (set (reg:V4SF 111) (vec_select:V4SF (reg:V8SF 105) (parallel [ (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ]))) 2711 {vec_extract_hi_v8sf} (expr_list:REG_DEAD (reg:V8SF 105) (nil))) but somehow it only tries 11 -> 13: Trying 11 -> 13: 11: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10]) REG_DEAD r106:V4SF 13: r107:V4SF=vec_select(r105:V8SF,parallel) ... Successfully matched this instruction: (set (reg:V8SF 105) (vec_concat:V8SF (reg:V4SF 106 [ MEM[base: _2, offset: 0B] ]) (mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ]) (const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16 A32]))) Successfully matched this instruction: (set (reg:V4SF 107) (reg:V4SF 106 [ MEM[base: _2, offset: 0B] ])) allowing combination of insns 11 and 13 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i211: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10]) deferring rescan insn with uid = 11. modifying insn i313: r107:V4SF=r106:V4SF REG_DEAD r106:V4SF then it continues: Trying 11 -> 25: 11: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10]) 25: r111:V4SF=vec_select(r105:V8SF,parallel) REG_DEAD r105:V8SF Successfully matched this instruction: (set (reg:V4SF 111) (mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ]) (const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16 A32])) rejecting combination of insns 11 and 25 original costs 4 + 4 = 8 replacement cost 12 where it rejects this for some reason... I think the cost of 4 assigned to 11 is bogus here (maybe combine uses wrong costs, not accounting for embedded MEMs?)
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #4 from Richard Biener --- With -mtune=core-avx2 we do vmovups (%rdi), %xmm1 vmovups (%rdi), %ymm3 ... vextractf128$0x1, %ymm3, %xmm1 with -mtune=intel the even more weird vmovups (%rdi), %xmm1 addq$32, %rdi vmovups -32(%rdi), %ymm3 ... vextractf128$0x1, %ymm3, %xmm1 I guess at runtime the vectorized variant isn't so much worse if not because of the loop size growth. So an additional "weight" we could put into the generic vectorizer cost metric would be the number of stmts generated - that is, computing an effective unroll factor and applying unroll limits to that. In this case we'd do 8-times unrolling (resulting loop body is twice as large compared to 8-unrolled scalar code).
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #3 from Richard Biener --- In the assembly I notice vinsertf128 $0x1, 16(%rdi), %ymm4, %ymm2 ... vextractf128$0x1, %ymm2, %xmm1 somehow we fail to elide the initial %ymm2 build with the upper half extraction being the only use... possibly because it has a memory operand?
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 --- Comment #2 from Richard Biener --- Created attachment 45531 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45531=edit scalar loop cost patch I'm testing this patch (not fixing the testcase, just improving costs).
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-01-25 CC||hubicka at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- scalar costs for single iteration: 0x322a040 _1 * 4 1 times scalar_stmt costs 12 in body 0x322a040 *_3 1 times scalar_load costs 12 in body 0x322a040 _4 + r_16 1 times scalar_stmt costs 12 in body single-iteration vector cost: 0x31651e0 *_3 1 times unaligned_load (misalign -1) costs 20 in body 0x31651e0 _4 + r_16 8 times vec_to_scalar costs 32 in body 0x31651e0 _4 + r_16 8 times scalar_stmt costs 96 in body there's the old issue that we use vec_to_scalar (originally meant to be used for the vector to scalar conversion in the reduction epilogue only, thus "free" on x86_64 since you can simply use %xmm0 for element zero) also for random element extraction. Besides this it's the usual issue that even if everything else is scalar the appearant savings by vectorizing the load (12 * 8 scalar vs. 20 vector) offsets quite a bit of eventual extra mangling (here the 8 vec_to_scalar operations). Making vec_to_scalar cost the same as a scalar load would offset those. But then this makes the (few, in epilogue only) real "free" vec_to_scalar ops expensive. So Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 268257) +++ gcc/config/i386/i386.c (working copy) @@ -45806,6 +45806,7 @@ ix86_builtin_vectorization_cost (enum ve case scalar_stmt: return fp ? ix86_cost->addss : COSTS_N_INSNS (1); + case vec_to_scalar: case scalar_load: /* load/store costs are relative to register move which is 2. Recompute it to COSTS_N_INSNS so everything have same base. */ @@ -45834,7 +45835,6 @@ ix86_builtin_vectorization_cost (enum ve index = 2; return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2; - case vec_to_scalar: case scalar_to_vec: return ix86_vec_cost (mode, ix86_cost->sse_op); but as said this is a hack in the target (needs to be benchmarked if to be considered). The real issue is that we use both vec_to_scalar and scalar_to_vec for different things that usually do not have even similar costs. Note that even with the above we vectorize the loop because in the scalar costing we cost the address-generation for the scalar load but not in the vector case (another discrepancy...). This happens because we detected a pattern involving this... t.c:1:35: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 4; t.c:1:35: note: widen_mult pattern recognized: patt_7 = (long unsigned int) patt_13; I have a fix for that (testing separately). With this fix alone we still vectorize as well.
[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049 Jakub Jelinek changed: What|Removed |Added Target||x86_64-linux CC||hjl.tools at gmail dot com, ||rsandifo at gcc dot gnu.org, ||uros at gcc dot gnu.org Target Milestone|--- |8.3