------- Comment #1 from rguenth at gcc dot gnu dot org  2008-05-02 12:36 -------
With a = b(k,:) - c manually unrolled the loop over k is unrolled with the
early loop unrolling pass which exposes the unvectorizable calls to sin/cos,
respective the complex temporaries introduced by the sincos pass
to the vectorizer which then punts.

The early unroller at -O3 is just limited by the maximum final loop size and
the trip count (400 and 8) and the unroller estimates

Loop 4 iterates 8 times.
  Loop size: 40
  Estimated size after unrolling: 216

SLP also doesn't handle vectorization of register operations but needs
memory source and destination operands(?).  Likewise SLP shouldn't be
confused by unvectorizable data types?

On x86_64 you can reproduce the missed vectorization with -O3 -ffast-math.

<bb 7>:
  # ivtmp.40_261 = PHI <9(6), ivtmp.40_240(8)>
  # sum1_5 = PHI <0.0(6), sum1_90(8)>
  # j_2 = PHI <1(6), j_94(8)>
  D.1032_55 = (real(kind=8)) j_2;
  D.1033_56 = D.1032_55 *
6.9813170079773179121929160828585736453533172607421875e-1;
  sincostmp.16_28 = __builtin_cexpi (D.1033_56);
  D.1034_57 = REALPART_EXPR <sincostmp.16_28>;
  D.1035_58 = sini_48 * D.1034_57;
  D.1036_59 = D.1035_58 * 5.0e-1;
  D.1037_62 = IMAGPART_EXPR <sincostmp.16_28>;
  D.1038_63 = sini_48 * D.1037_62;
  D.1039_64 = D.1038_63 * 5.0e-1;
  D.1044_128 = pretmp.30_150 - D.1036_59;
  D.1047_132 = pretmp.30_154 - D.1039_64;
  D.1052_137 = __builtin_pow (D.1044_128, 2.0e+0);
  D.1054_138 = __builtin_pow (D.1047_132, 2.0e+0);
  D.1044_149 = pretmp.30_168 - D.1036_59;
  D.1047_153 = pretmp.30_172 - D.1039_64;
  D.1052_158 = __builtin_pow (D.1044_149, 2.0e+0);
  D.1054_159 = __builtin_pow (D.1047_153, 2.0e+0);
  D.1044_170 = pretmp.30_188 - D.1036_59;
  D.1047_174 = pretmp.30_192 - D.1039_64;
  D.1052_179 = __builtin_pow (D.1044_170, 2.0e+0);
  D.1054_180 = __builtin_pow (D.1047_174, 2.0e+0);
  D.1044_191 = pretmp.30_206 - D.1036_59;
  D.1047_195 = pretmp.30_210 - D.1039_64;
  D.1052_200 = __builtin_pow (D.1044_191, 2.0e+0);
  D.1054_201 = __builtin_pow (D.1047_195, 2.0e+0);
  D.1044_212 = pretmp.30_218 - D.1036_59;
  D.1047_216 = pretmp.30_230 - D.1039_64;
  D.1052_221 = __builtin_pow (D.1044_212, 2.0e+0);
  D.1054_222 = __builtin_pow (D.1047_216, 2.0e+0);
  D.1044_233 = pretmp.30_238 - D.1036_59;
  D.1047_237 = pretmp.30_248 - D.1039_64;
  D.1052_242 = __builtin_pow (D.1044_233, 2.0e+0);
  D.1054_243 = __builtin_pow (D.1047_237, 2.0e+0);
  D.1044_254 = pretmp.30_256 - D.1036_59;
  D.1047_258 = pretmp.30_260 - D.1039_64;
  D.1052_263 = __builtin_pow (D.1044_254, 2.0e+0);
  D.1054_264 = __builtin_pow (D.1047_258, 2.0e+0);
  D.1044_275 = pretmp.30_276 - D.1036_59;
  D.1047_279 = pretmp.30_280 - D.1039_64;
  D.1052_284 = __builtin_pow (D.1044_275, 2.0e+0);
  D.1054_285 = __builtin_pow (D.1047_279, 2.0e+0);
  D.1044_71 = pretmp.30_68 - D.1036_59;
  D.1047_76 = pretmp.30_73 - D.1039_64;
  D.1052_83 = __builtin_pow (D.1044_71, 2.0e+0);
  D.1054_85 = __builtin_pow (D.1047_76, 2.0e+0);
  D.1055_86 = D.1054_85 + D.1052_83;
  dotp_89 = D.1055_86 + pretmp.33_294;
  dotp_288 = dotp_89 + D.1052_137;
  D.1055_286 = dotp_288 + D.1054_138;
  sum1_289 = pretmp.33_249 + D.1055_286;
  dotp_267 = sum1_289 + D.1052_158;
  D.1055_265 = dotp_267 + D.1054_159;
  sum1_268 = pretmp.33_228 + D.1055_265;
  dotp_246 = sum1_268 + D.1052_179;
  D.1055_244 = dotp_246 + D.1054_180;
  sum1_247 = pretmp.33_207 + D.1055_244;
  dotp_225 = sum1_247 + D.1052_200;
  D.1055_223 = dotp_225 + D.1054_201;
  sum1_226 = pretmp.33_186 + D.1055_223;
  dotp_204 = sum1_226 + D.1052_221;
  D.1055_202 = dotp_204 + D.1054_222;
  sum1_205 = pretmp.33_165 + D.1055_202;
  dotp_183 = sum1_205 + D.1052_242;
  D.1055_181 = dotp_183 + D.1054_243;
  sum1_184 = pretmp.33_144 + D.1055_181;
  dotp_162 = sum1_184 + D.1052_263;
  D.1055_160 = dotp_162 + D.1054_264;
  sum1_163 = pretmp.33_123 + D.1055_160;
  dotp_141 = sum1_163 + D.1052_284;
  D.1055_139 = dotp_141 + D.1054_285;
  sum1_142 = D.1055_139 + pretmp.33_292;
  sum1_90 = sum1_142 + sum1_5;
  j_94 = j_2 + 1;
  ivtmp.40_240 = ivtmp.40_261 - 1;
  if (ivtmp.40_240 == 0)
    goto <bb 9>;
  else
    goto <bb 8>;


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm dot com
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2008-05-02 12:36:33
               date|                            |
            Summary|[4.4 Regression] early loop |[4.4 Regression] early loop
                   |unrolling pass prevents     |unrolling pass prevents
                   |vectorization               |vectorization, SLP doesn't
                   |                            |do its job
   Target Milestone|---                         |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36099

Reply via email to