https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160

--- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
> 
> The above doesn't look wrong (but may miss the rest of the IL).  On
> x86_64 this looks like
> 
>   <bb 4> [local count: 105119324]:
>   # sum0_41 = PHI <sum0_28(3)>
>   # sum1_39 = PHI <sum1_29(3)>
>   # sum2_37 = PHI <sum2_30(3)>
>   # sum3_35 = PHI <sum3_31(3)>
>   # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)>
>   _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>;
>   _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>;
>   _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>;
>   _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>;
>   _74 = _58 + _57;
>   _76 = _56 + _74;
>   _78 = _55 + _76;
> 
>   <bb 5> [local count: 118111600]:
>   # prephitmp_79 = PHI <_78(4), 0.0(2)>
>   return prephitmp_79;
> 

Yeah, it looks expected without unrolling.

> when unrolling is applied, thus with a larger VF, you should ideally
> see the vectors accumulated.
> 
> Btw, I've fixed a SLP reduction issue two days ago in
> r13-3226-gee467644c53ee2
> though that looks unrelated?

Thanks for the information, I'll double check it.

> 
> When I force a larger VF on x86 by adding a int store in the loop I see
> 
>   <bb 11> [local count: 94607391]:
>   # sum0_48 = PHI <sum0_29(3)>
>   # sum1_36 = PHI <sum1_30(3)>
>   # sum2_35 = PHI <sum2_31(3)>
>   # sum3_24 = PHI <sum3_32(3)>
>   # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)>
>   # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)>
>   # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)>
>   # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)>
>   _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>;
>   _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>;
>   _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>;
>   _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>;
>   _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>;
>   _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>;
>   _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>;
>   _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>;
>   _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>;
>   _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>;
>   _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>;
>   _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>;
>   _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>;
>   _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>;
>   _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>;
>   _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>;
>   _130 = _114 + _118;
>   _131 = _115 + _119;
>   _132 = _116 + _120;
>   _133 = _117 + _121;
>   _134 = _130 + _122;
>   _135 = _131 + _123;
>   _136 = _132 + _124;
>   _137 = _133 + _125;
>   _138 = _134 + _126;
> 
> see how the lanes from the different vectors are accumulated?  (yeah,
> we should simply add the vectors!)

Yes, it's the same as what I saw on ppc64le, but the closely following dce6
removes the three vect_sum3_32 (in your dump, they are
vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the
separated accumulated lane values (_138 -> sum0 ...) but only use
vect_sum3_32.16_110.

Reply via email to