------- Comment #1 from rguenth at gcc dot gnu dot org 2008-05-02 12:36 ------- With a = b(k,:) - c manually unrolled the loop over k is unrolled with the early loop unrolling pass which exposes the unvectorizable calls to sin/cos, respective the complex temporaries introduced by the sincos pass to the vectorizer which then punts.
The early unroller at -O3 is just limited by the maximum final loop size and the trip count (400 and 8) and the unroller estimates Loop 4 iterates 8 times. Loop size: 40 Estimated size after unrolling: 216 SLP also doesn't handle vectorization of register operations but needs memory source and destination operands(?). Likewise SLP shouldn't be confused by unvectorizable data types? On x86_64 you can reproduce the missed vectorization with -O3 -ffast-math. <bb 7>: # ivtmp.40_261 = PHI <9(6), ivtmp.40_240(8)> # sum1_5 = PHI <0.0(6), sum1_90(8)> # j_2 = PHI <1(6), j_94(8)> D.1032_55 = (real(kind=8)) j_2; D.1033_56 = D.1032_55 * 6.9813170079773179121929160828585736453533172607421875e-1; sincostmp.16_28 = __builtin_cexpi (D.1033_56); D.1034_57 = REALPART_EXPR <sincostmp.16_28>; D.1035_58 = sini_48 * D.1034_57; D.1036_59 = D.1035_58 * 5.0e-1; D.1037_62 = IMAGPART_EXPR <sincostmp.16_28>; D.1038_63 = sini_48 * D.1037_62; D.1039_64 = D.1038_63 * 5.0e-1; D.1044_128 = pretmp.30_150 - D.1036_59; D.1047_132 = pretmp.30_154 - D.1039_64; D.1052_137 = __builtin_pow (D.1044_128, 2.0e+0); D.1054_138 = __builtin_pow (D.1047_132, 2.0e+0); D.1044_149 = pretmp.30_168 - D.1036_59; D.1047_153 = pretmp.30_172 - D.1039_64; D.1052_158 = __builtin_pow (D.1044_149, 2.0e+0); D.1054_159 = __builtin_pow (D.1047_153, 2.0e+0); D.1044_170 = pretmp.30_188 - D.1036_59; D.1047_174 = pretmp.30_192 - D.1039_64; D.1052_179 = __builtin_pow (D.1044_170, 2.0e+0); D.1054_180 = __builtin_pow (D.1047_174, 2.0e+0); D.1044_191 = pretmp.30_206 - D.1036_59; D.1047_195 = pretmp.30_210 - D.1039_64; D.1052_200 = __builtin_pow (D.1044_191, 2.0e+0); D.1054_201 = __builtin_pow (D.1047_195, 2.0e+0); D.1044_212 = pretmp.30_218 - D.1036_59; D.1047_216 = pretmp.30_230 - D.1039_64; D.1052_221 = __builtin_pow (D.1044_212, 2.0e+0); D.1054_222 = __builtin_pow (D.1047_216, 2.0e+0); D.1044_233 = pretmp.30_238 - D.1036_59; D.1047_237 = pretmp.30_248 - D.1039_64; D.1052_242 = __builtin_pow (D.1044_233, 2.0e+0); D.1054_243 = __builtin_pow (D.1047_237, 2.0e+0); D.1044_254 = pretmp.30_256 - D.1036_59; D.1047_258 = pretmp.30_260 - D.1039_64; D.1052_263 = __builtin_pow (D.1044_254, 2.0e+0); D.1054_264 = __builtin_pow (D.1047_258, 2.0e+0); D.1044_275 = pretmp.30_276 - D.1036_59; D.1047_279 = pretmp.30_280 - D.1039_64; D.1052_284 = __builtin_pow (D.1044_275, 2.0e+0); D.1054_285 = __builtin_pow (D.1047_279, 2.0e+0); D.1044_71 = pretmp.30_68 - D.1036_59; D.1047_76 = pretmp.30_73 - D.1039_64; D.1052_83 = __builtin_pow (D.1044_71, 2.0e+0); D.1054_85 = __builtin_pow (D.1047_76, 2.0e+0); D.1055_86 = D.1054_85 + D.1052_83; dotp_89 = D.1055_86 + pretmp.33_294; dotp_288 = dotp_89 + D.1052_137; D.1055_286 = dotp_288 + D.1054_138; sum1_289 = pretmp.33_249 + D.1055_286; dotp_267 = sum1_289 + D.1052_158; D.1055_265 = dotp_267 + D.1054_159; sum1_268 = pretmp.33_228 + D.1055_265; dotp_246 = sum1_268 + D.1052_179; D.1055_244 = dotp_246 + D.1054_180; sum1_247 = pretmp.33_207 + D.1055_244; dotp_225 = sum1_247 + D.1052_200; D.1055_223 = dotp_225 + D.1054_201; sum1_226 = pretmp.33_186 + D.1055_223; dotp_204 = sum1_226 + D.1052_221; D.1055_202 = dotp_204 + D.1054_222; sum1_205 = pretmp.33_165 + D.1055_202; dotp_183 = sum1_205 + D.1052_242; D.1055_181 = dotp_183 + D.1054_243; sum1_184 = pretmp.33_144 + D.1055_181; dotp_162 = sum1_184 + D.1052_263; D.1055_160 = dotp_162 + D.1054_264; sum1_163 = pretmp.33_123 + D.1055_160; dotp_141 = sum1_163 + D.1052_284; D.1055_139 = dotp_141 + D.1054_285; sum1_142 = D.1055_139 + pretmp.33_292; sum1_90 = sum1_142 + sum1_5; j_94 = j_2 + 1; ivtmp.40_240 = ivtmp.40_261 - 1; if (ivtmp.40_240 == 0) goto <bb 9>; else goto <bb 8>; -- rguenth at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |irar at il dot ibm dot com Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Keywords| |missed-optimization Last reconfirmed|0000-00-00 00:00:00 |2008-05-02 12:36:33 date| | Summary|[4.4 Regression] early loop |[4.4 Regression] early loop |unrolling pass prevents |unrolling pass prevents |vectorization |vectorization, SLP doesn't | |do its job Target Milestone|--- |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36099