[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-03-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #11 from Richard Biener  ---
Just an update on costs:

t.c:1:35: note:   === vect_compute_single_scalar_iteration_cost ===
0x483e120 *_3 1 times scalar_load costs 12 in body
0x483e120 _4 + r_16 1 times scalar_stmt costs 12 in body

and the vector body cost:

0x492f9d0 *_3 1 times unaligned_load (misalign -1) costs 20 in body
0x492f9d0 _4 + r_16 8 times vec_to_scalar costs 32 in body
0x492f9d0 _4 + r_16 8 times scalar_stmt costs 96 in body

That results in the overall (and sensible)

t.c:1:35: note:  Cost model analysis:
  Vector inside of loop cost: 148
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0

where for one vector iteration we have 8 scalar iterations, thus 24 * 8 = 192

As mentioned elsewhere the vectorizer cost model does not care for
pipeline latency or dependency issues nor execution resources competition.
It also does not care for loop size (the vector loop has one stmt more than
the unrolled scalar loop for example).  I once played with limiting the
vectorization loop growth with the unroll parameters, but we're far from
hitting those here.

Btw, a microbenchmark shows the loops execute in about the same time
vectorized with -mavx2 compared to scalar and not unrolled.  When
the scalar loop is unrolled 8 times the runtime is the same again
(this is all benchmarked on a Haswell machine).  If you disregard noise
then the scalar unrolled loop is maybe a tid bit faster than the other
cases.

I believe the limiting factor is the dependence chain of the adds,
there's plenty of parallel execution resources to cope for uglyness
and friends.

This leaves the code bloat as regression I think.

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-02-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|8.3 |8.4

--- Comment #10 from Jakub Jelinek  ---
GCC 8.3 has been released.

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #9 from Richard Biener  ---
Split out the target cost issue to PR89114, it'll improve code-gen for the
unwanted vectorization a bit at least.  That's independent on the
vectorizer cost issue.

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #8 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #5)
> So combine can see

[ snip, 11 ]

> with its uses

[ snip, 13 and 25 ]

> but somehow it only tries 11 -> 13:

combine only tries to combine something with its first use.  Trying second (or
third, etc.) uses as well would easily take exponential time complexity.

I do however want combine to try to combine an insn together with its first two
uses.  That is just as linear as even simple 1+1 combinations, and it is likely
to succeed (in fact there is at least one other PR where I wanted this).

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #7 from Richard Biener  ---
Author: rguenth
Date: Fri Jan 25 12:46:24 2019
New Revision: 268264

URL: https://gcc.gnu.org/viewcvs?rev=268264=gcc=rev
Log:
2019-01-25  Richard Biener  

PR tree-optimization/89049
* tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
Look at the pattern stmt to determine if the stmt is vectorized.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-loop.c

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Richard Biener  changed:

   What|Removed |Added

 CC|segher at gcc dot gnu.org  |

--- Comment #6 from Richard Biener  ---
From a quick look rtx_cost should end up recursing to the MEM.  Oh.
targetm.rtx_costs is expected to handle sub-costs but does

case VEC_SELECT:
case VEC_CONCAT:
case VEC_DUPLICATE:
  /* ??? Assume all of these vector manipulation patterns are
 recognizable.  In which case they all pretty much have the
 same cost.  */
 *total = cost->sse_op;
 return true;

which is of course bogus for patterns involving MEMs (new possibility with
AVX).  Not combines issue.

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Richard Biener  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
So combine can see

(insn 11 10 13 3 (set (reg:V8SF 105)
(vec_concat:V8SF (reg:V4SF 106 [ MEM[base: _2, offset: 0B] ])
(mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ])
(const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16
A32]))) "t.c":1:72 5046 {avx_vec_concatv8sf}
 (nil))

with its uses

(insn 13 11 14 3 (set (reg:V4SF 107)
(vec_select:V4SF (reg:V8SF 105)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
]))) 2702 {vec_extract_lo_v8sf}
 (nil))


(insn 25 24 26 3 (set (reg:V4SF 111)
(vec_select:V4SF (reg:V8SF 105)
(parallel [
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 6 [0x6])
(const_int 7 [0x7])
]))) 2711 {vec_extract_hi_v8sf}
 (expr_list:REG_DEAD (reg:V8SF 105)
(nil)))

but somehow it only tries 11 -> 13:

Trying 11 -> 13:
   11: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10])
  REG_DEAD r106:V4SF
   13: r107:V4SF=vec_select(r105:V8SF,parallel)
...
Successfully matched this instruction:
(set (reg:V8SF 105)
(vec_concat:V8SF (reg:V4SF 106 [ MEM[base: _2, offset: 0B] ])
(mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ])
(const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16
A32])))
Successfully matched this instruction:
(set (reg:V4SF 107)
(reg:V4SF 106 [ MEM[base: _2, offset: 0B] ]))
allowing combination of insns 11 and 13
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i211: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10])
deferring rescan insn with uid = 11.
modifying insn i313: r107:V4SF=r106:V4SF
  REG_DEAD r106:V4SF

then it continues:

Trying 11 -> 25:
   11: r105:V8SF=vec_concat(r106:V4SF,[r85:DI+0x10])
   25: r111:V4SF=vec_select(r105:V8SF,parallel)
  REG_DEAD r105:V8SF
Successfully matched this instruction:
(set (reg:V4SF 111)
(mem:V4SF (plus:DI (reg:DI 85 [ ivtmp.11 ])
(const_int 16 [0x10])) [1 MEM[base: _2, offset: 0B]+16 S16 A32]))
rejecting combination of insns 11 and 25
original costs 4 + 4 = 8
replacement cost 12

where it rejects this for some reason...  I think the cost of 4
assigned to 11 is bogus here (maybe combine uses wrong costs, not
accounting for embedded MEMs?)

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #4 from Richard Biener  ---
With -mtune=core-avx2 we do

vmovups (%rdi), %xmm1
vmovups (%rdi), %ymm3
...
vextractf128$0x1, %ymm3, %xmm1

with -mtune=intel the even more weird

vmovups (%rdi), %xmm1
addq$32, %rdi
vmovups -32(%rdi), %ymm3
...
vextractf128$0x1, %ymm3, %xmm1

I guess at runtime the vectorized variant isn't so much worse if not
because of the loop size growth.  So an additional "weight" we could
put into the generic vectorizer cost metric would be the number of
stmts generated - that is, computing an effective unroll factor and
applying unroll limits to that.  In this case we'd do 8-times unrolling
(resulting loop body is twice as large compared to 8-unrolled scalar code).

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #3 from Richard Biener  ---
In the assembly I notice

vinsertf128 $0x1, 16(%rdi), %ymm4, %ymm2
...
vextractf128$0x1, %ymm2, %xmm1

somehow we fail to elide the initial %ymm2 build with the upper half
extraction being the only use...  possibly because it has a memory
operand?

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #2 from Richard Biener  ---
Created attachment 45531
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45531=edit
scalar loop cost patch

I'm testing this patch (not fixing the testcase, just improving costs).

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-01-25
 CC||hubicka at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
scalar costs for single iteration:

0x322a040 _1 * 4 1 times scalar_stmt costs 12 in body
0x322a040 *_3 1 times scalar_load costs 12 in body
0x322a040 _4 + r_16 1 times scalar_stmt costs 12 in body

single-iteration vector cost:

0x31651e0 *_3 1 times unaligned_load (misalign -1) costs 20 in body
0x31651e0 _4 + r_16 8 times vec_to_scalar costs 32 in body
0x31651e0 _4 + r_16 8 times scalar_stmt costs 96 in body

there's the old issue that we use vec_to_scalar (originally meant to be
used for the vector to scalar conversion in the reduction epilogue only,
thus "free" on x86_64 since you can simply use %xmm0 for element zero)
also for random element extraction.

Besides this it's the usual issue that even if everything else is scalar
the appearant savings by vectorizing the load (12 * 8 scalar vs. 20 vector)
offsets quite a bit of eventual extra mangling (here the 8 vec_to_scalar
operations).  Making vec_to_scalar cost the same as a scalar load would
offset those.  But then this makes the (few, in epilogue only) real
"free" vec_to_scalar ops expensive.

So

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 268257)
+++ gcc/config/i386/i386.c  (working copy)
@@ -45806,6 +45806,7 @@ ix86_builtin_vectorization_cost (enum ve
   case scalar_stmt:
 return fp ? ix86_cost->addss : COSTS_N_INSNS (1);

+  case vec_to_scalar:
   case scalar_load:
/* load/store costs are relative to register move which is 2. Recompute
   it to COSTS_N_INSNS so everything have same base.  */
@@ -45834,7 +45835,6 @@ ix86_builtin_vectorization_cost (enum ve
  index = 2;
 return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;

-  case vec_to_scalar:
   case scalar_to_vec:
 return ix86_vec_cost (mode, ix86_cost->sse_op);

but as said this is a hack in the target (needs to be benchmarked if
to be considered).  The real issue is that we use both vec_to_scalar
and scalar_to_vec for different things that usually do not have
even similar costs.

Note that even with the above we vectorize the loop because in the
scalar costing we cost the address-generation for the scalar load
but not in the vector case (another discrepancy...).  This happens
because we detected a pattern involving this...

t.c:1:35: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 4;
t.c:1:35: note:   widen_mult pattern recognized: patt_7 = (long unsigned int)
patt_13;

I have a fix for that (testing separately).  With this fix alone we
still vectorize as well.

[Bug tree-optimization/89049] [8/9 Regression] Unexpected vectorization

2019-01-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Jakub Jelinek  changed:

   What|Removed |Added

 Target||x86_64-linux
 CC||hjl.tools at gmail dot com,
   ||rsandifo at gcc dot gnu.org,
   ||uros at gcc dot gnu.org
   Target Milestone|--- |8.3