https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68956
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEW CC| |rguenth at gcc dot gnu.org Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- t.f:13:0: note: loop vectorized is the offending vectorization. We if-convert with masked-loads: <bb 7>: # i_1 = PHI <1(6), i_37(8)> # ij_3 = PHI <ij_2(6), ij_25(8)> ij_25 = ij_3 + 1; ic_26 = i_1 <= _39; _27 = jc_24 & ic_26; _54 = &*in1_28(D)[ij_3]; _ifc__55 = _27; _29 = MASK_LOAD (_54, 64B, _ifc__55); _56 = &*in2_30(D)[ij_3]; _31 = MASK_LOAD (_56, 64B, _ifc__55); _32 = _29 + _31; sum_33 = (real(kind=4)) _32; _43 = (real(kind=8)) sum_33; prephitmp_41 = _27 ? _43 : 0.0; *out_35(D)[ij_3] = prephitmp_41; i_37 = i_1 + 1; if (i_1 == j_5) goto <bb 14>; but to me there is nothing obviously wrong with .optimized: <bb 7>: # vect_vec_iv_.20_99 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8 }(6), vect_vec_iv_.20_100(7)> # ivtmp.51_57 = PHI <0(6), ivtmp.51_15(7)> # ivtmp.52_18 = PHI <ivtmp.52_13(6), ivtmp.52_42(7)> # ivtmp.55_45 = PHI <ivtmp.55_50(6), ivtmp.55_16(7)> # ivtmp.57_51 = PHI <ivtmp.57_34(6), ivtmp.57_52(7)> vectp.30_122 = (vector(4) real(kind=8) *) ivtmp.55_45; vectp.26_112 = (vector(4) real(kind=8) *) ivtmp.52_18; vect_vec_iv_.20_100 = vect_vec_iv_.20_99 + { 8, 8, 8, 8, 8, 8, 8, 8 }; mask_ic_26.21_102 = vect_vec_iv_.20_99 <= vect_cst__101; mask__27.22_105 = mask_ic_26.21_102 & vect_cst__104; mask_patt_58.24_107 = [vec_unpack_lo_expr] mask__27.22_105; mask_patt_58.24_108 = [vec_unpack_hi_expr] mask__27.22_105; vect_patt_59.25_114 = MASK_LOAD (vectp.26_112, 8B, mask_patt_58.24_107); _47 = ivtmp.52_18 + 32; _46 = (vector(4) real(kind=8) *) _47; vect_patt_59.25_116 = MASK_LOAD (_46, 8B, mask_patt_58.24_108); vect_patt_61.29_124 = MASK_LOAD (vectp.30_122, 8B, mask_patt_58.24_107); _48 = ivtmp.55_45 + 32; _49 = (vector(4) real(kind=8) *) _48; vect_patt_61.29_126 = MASK_LOAD (_49, 8B, mask_patt_58.24_108); vect__32.32_127 = vect_patt_59.25_114 + vect_patt_61.29_124; vect__32.32_128 = vect_patt_59.25_116 + vect_patt_61.29_126; vect_sum_33.33_129 = VEC_PACK_TRUNC_EXPR <vect__32.32_127, vect__32.32_128>; vect__43.34_130 = [vec_unpack_lo_expr] vect_sum_33.33_129; vect__43.34_131 = [vec_unpack_hi_expr] vect_sum_33.33_129; vect_patt_63.36_135 = VEC_COND_EXPR <mask_patt_58.24_107, vect__43.34_130, { 0.0, 0.0, 0.0, 0.0 }>; vect_patt_63.36_136 = VEC_COND_EXPR <mask_patt_58.24_108, vect__43.34_131, { 0.0, 0.0, 0.0, 0.0 }>; _62 = (void *) ivtmp.57_51; MEM[base: _62, offset: 0B] = vect_patt_63.36_135; MEM[base: _62, offset: 32B] = vect_patt_63.36_136; ivtmp.51_15 = ivtmp.51_57 + 1; ivtmp.52_42 = ivtmp.52_18 + 64; ivtmp.55_16 = ivtmp.55_45 + 64; ivtmp.57_52 = ivtmp.57_51 + 64; if (ivtmp.51_15 >= bnd.16_65) goto <bb 11>; so I suspect a backend / RTL optimization issue. Confirmed at least. Bisection would be nice.