https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160
--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 > > --- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> --- > > > > The above doesn't look wrong (but may miss the rest of the IL). On > > x86_64 this looks like > > > > <bb 4> [local count: 105119324]: > > # sum0_41 = PHI <sum0_28(3)> > > # sum1_39 = PHI <sum1_29(3)> > > # sum2_37 = PHI <sum2_30(3)> > > # sum3_35 = PHI <sum3_31(3)> > > # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)> > > _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>; > > _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>; > > _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>; > > _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>; > > _74 = _58 + _57; > > _76 = _56 + _74; > > _78 = _55 + _76; > > > > <bb 5> [local count: 118111600]: > > # prephitmp_79 = PHI <_78(4), 0.0(2)> > > return prephitmp_79; > > > > Yeah, it looks expected without unrolling. > > > when unrolling is applied, thus with a larger VF, you should ideally > > see the vectors accumulated. > > > > Btw, I've fixed a SLP reduction issue two days ago in > > r13-3226-gee467644c53ee2 > > though that looks unrelated? > > Thanks for the information, I'll double check it. > > > > > When I force a larger VF on x86 by adding a int store in the loop I see > > > > <bb 11> [local count: 94607391]: > > # sum0_48 = PHI <sum0_29(3)> > > # sum1_36 = PHI <sum1_30(3)> > > # sum2_35 = PHI <sum2_31(3)> > > # sum3_24 = PHI <sum3_32(3)> > > # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)> > > # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)> > > # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)> > > # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)> > > _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>; > > _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>; > > _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>; > > _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>; > > _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>; > > _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>; > > _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>; > > _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>; > > _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>; > > _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>; > > _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>; > > _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>; > > _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>; > > _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>; > > _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>; > > _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>; > > _130 = _114 + _118; > > _131 = _115 + _119; > > _132 = _116 + _120; > > _133 = _117 + _121; > > _134 = _130 + _122; > > _135 = _131 + _123; > > _136 = _132 + _124; > > _137 = _133 + _125; > > _138 = _134 + _126; > > > > see how the lanes from the different vectors are accumulated? (yeah, > > we should simply add the vectors!) > > Yes, it's the same as what I saw on ppc64le, but the closely following dce6 > removes the three vect_sum3_32 (in your dump, they are > vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the > separated accumulated lane values (_138 -> sum0 ...) but only use > vect_sum3_32.16_110. I do - the epilog is even vectorized and it works fine at runtime.