https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2017-05-24 CC| |uros at gcc dot gnu.org Blocks| |53947 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- So the vectorizer uses "whole vector shift" to do the final reduction: vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10, 11 }>; vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6; vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8, 9 }>; vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20; vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7, 8 }>; vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18; stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>; I can see that for Zen that is bad (even for avx256 in general eventually because it crosses lanes). That is, it was supposed to end up using pslldq, not the vperm + palign combos. That said, the vectorizer could "easily" demote this to first add the two halves and then continue with the reduction scheme. The GIMPLE representation of this is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86 backend can handle (hi/lo subregs?). I'll see to handle this better in the vectorizer. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations