[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

rguenth at gcc dot gnu.org Wed, 24 May 2017 03:53:57 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2017-05-24
                 CC|                            |uros at gcc dot gnu.org
             Blocks|                            |53947
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the vectorizer uses "whole vector shift" to do the final reduction:

  vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 },
{ 4, 5, 6, 7, 8, 9, 10, 11 }>;
  vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
  vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 2, 3, 4, 5, 6, 7, 8, 9 }>;
  vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
  vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0
}, { 1, 2, 3, 4, 5, 6, 7, 8 }>;
  vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
  stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>;

I can see that for Zen that is bad (even for avx256 in general eventually
because
it crosses lanes).

That is, it was supposed to end up using pslldq, not the vperm + palign combos.

That said, the vectorizer could "easily" demote this to first add the two
halves
and then continue with the reduction scheme.  The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).

I'll see to handle this better in the vectorizer.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to