https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P2 Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2019-01-25 CC| |hubicka at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- scalar costs for single iteration: 0x322a040 _1 * 4 1 times scalar_stmt costs 12 in body 0x322a040 *_3 1 times scalar_load costs 12 in body 0x322a040 _4 + r_16 1 times scalar_stmt costs 12 in body single-iteration vector cost: 0x31651e0 *_3 1 times unaligned_load (misalign -1) costs 20 in body 0x31651e0 _4 + r_16 8 times vec_to_scalar costs 32 in body 0x31651e0 _4 + r_16 8 times scalar_stmt costs 96 in body there's the old issue that we use vec_to_scalar (originally meant to be used for the vector to scalar conversion in the reduction epilogue only, thus "free" on x86_64 since you can simply use %xmm0 for element zero) also for random element extraction. Besides this it's the usual issue that even if everything else is scalar the appearant savings by vectorizing the load (12 * 8 scalar vs. 20 vector) offsets quite a bit of eventual extra mangling (here the 8 vec_to_scalar operations). Making vec_to_scalar cost the same as a scalar load would offset those. But then this makes the (few, in epilogue only) real "free" vec_to_scalar ops expensive. So Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 268257) +++ gcc/config/i386/i386.c (working copy) @@ -45806,6 +45806,7 @@ ix86_builtin_vectorization_cost (enum ve case scalar_stmt: return fp ? ix86_cost->addss : COSTS_N_INSNS (1); + case vec_to_scalar: case scalar_load: /* load/store costs are relative to register move which is 2. Recompute it to COSTS_N_INSNS so everything have same base. */ @@ -45834,7 +45835,6 @@ ix86_builtin_vectorization_cost (enum ve index = 2; return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2; - case vec_to_scalar: case scalar_to_vec: return ix86_vec_cost (mode, ix86_cost->sse_op); but as said this is a hack in the target (needs to be benchmarked if to be considered). The real issue is that we use both vec_to_scalar and scalar_to_vec for different things that usually do not have even similar costs. Note that even with the above we vectorize the loop because in the scalar costing we cost the address-generation for the scalar load but not in the vector case (another discrepancy...). This happens because we detected a pattern involving this... t.c:1:35: note: vect_recog_widen_mult_pattern: detected: _2 = _1 * 4; t.c:1:35: note: widen_mult pattern recognized: patt_7 = (long unsigned int) patt_13; I have a fix for that (testing separately). With this fix alone we still vectorize as well.