https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120233

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
           Keywords|                            |needs-bisection

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the difference is that we now BB vectorize foo2 to

  <bb 2> [local count: 1073741824]:
  _1 = *b_10(D);
  _2 = _1 >> 8;
  _17 = {_2, _1};
  _3 = (char) _2;
  _4 = (char) _1;
  _5 = MEM[(short int *)b_10(D) + 2B];
  _6 = _5 >> 8;
  _16 = {_6, _5};
  vect__3.7_18 = VEC_PACK_TRUNC_EXPR <_17, _16>;
  _7 = (char) _6;
  _8 = (char) _5;
  vectp.9_19 = a_11(D);
  MEM <vector(4) char> [(char *)vectp.9_19] = vect__3.7_18;

note: Cost model analysis:
_3 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
_7 1 times scalar_store costs 12 in body
_8 1 times scalar_store costs 12 in body
(char) _2 1 times scalar_stmt costs 4 in body
(char) _1 1 times scalar_stmt costs 4 in body
(char) _6 1 times scalar_stmt costs 4 in body
(char) _5 1 times scalar_stmt costs 4 in body
(char) _2 1 times vec_promote_demote costs 12 in body
node 0xc601b90 1 times vec_construct costs 28 in prologue
<unknown> 1 times vec_construct costs 4 in prologue
_3 1 times unaligned_store (misalign -1) costs 12 in body
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gcc.target/i386/pr108938-3.c:22:8:
note: Cost model analysis for part in loop 0:
  Vector cost: 56
  Scalar cost: 64


But reverting the cited offending revision (which has no effect on
vectorization) doesn't restore good code here.  So the bisection must
be wrong?

There's the old argument of stores being too costly and dominating the
cost-benefit analysis despite having no effect on latency.  Also the
issue that two-lane BB vectorization should have bad effect on latency
and dependence and might only have positive frontend effects (if at all)
and most of the time no backend positive effect due to excessive execution
resources available.

On the 15 branch we vectorize both functions but we only vectorize the
store itself, not the demotion:

  _16 = {_3, _4, _7, _8};
  vectp.10_17 = a_11(D);
  MEM <vector(4) char> [(char *)vectp.10_17] = _16;

on trunk there's the missed optimization to recognize

  _1 = *b_10(D);
  _2 = _1 >> 8;
  _17 = {_2, _1};
  _5 = MEM[(short int *)b_10(D) + 2B]; 
  _6 = _5 >> 8;
  _16 = {_6, _5};
  vect__3.7_18 = VEC_PACK_TRUNC_EXPR <_17, _16>;

to sth better.  But I'm not sure it's worth pattern matching this.

Reply via email to