https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2019-01-25
                 CC|                            |hubicka at gcc dot gnu.org
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
scalar costs for single iteration:

0x322a040 _1 * 4 1 times scalar_stmt costs 12 in body
0x322a040 *_3 1 times scalar_load costs 12 in body
0x322a040 _4 + r_16 1 times scalar_stmt costs 12 in body

single-iteration vector cost:

0x31651e0 *_3 1 times unaligned_load (misalign -1) costs 20 in body
0x31651e0 _4 + r_16 8 times vec_to_scalar costs 32 in body
0x31651e0 _4 + r_16 8 times scalar_stmt costs 96 in body

there's the old issue that we use vec_to_scalar (originally meant to be
used for the vector to scalar conversion in the reduction epilogue only,
thus "free" on x86_64 since you can simply use %xmm0 for element zero)
also for random element extraction.

Besides this it's the usual issue that even if everything else is scalar
the appearant savings by vectorizing the load (12 * 8 scalar vs. 20 vector)
offsets quite a bit of eventual extra mangling (here the 8 vec_to_scalar
operations).  Making vec_to_scalar cost the same as a scalar load would
offset those.  But then this makes the (few, in epilogue only) real
"free" vec_to_scalar ops expensive.

So

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 268257)
+++ gcc/config/i386/i386.c      (working copy)
@@ -45806,6 +45806,7 @@ ix86_builtin_vectorization_cost (enum ve
       case scalar_stmt:
         return fp ? ix86_cost->addss : COSTS_N_INSNS (1);

+      case vec_to_scalar:
       case scalar_load:
        /* load/store costs are relative to register move which is 2. Recompute
           it to COSTS_N_INSNS so everything have same base.  */
@@ -45834,7 +45835,6 @@ ix86_builtin_vectorization_cost (enum ve
          index = 2;
         return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;

-      case vec_to_scalar:
       case scalar_to_vec:
         return ix86_vec_cost (mode, ix86_cost->sse_op);

but as said this is a hack in the target (needs to be benchmarked if
to be considered).  The real issue is that we use both vec_to_scalar
and scalar_to_vec for different things that usually do not have
even similar costs.

Note that even with the above we vectorize the loop because in the
scalar costing we cost the address-generation for the scalar load
but not in the vector case (another discrepancy...).  This happens
because we detected a pattern involving this...

t.c:1:35: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 4;
t.c:1:35: note:   widen_mult pattern recognized: patt_7 = (long unsigned int)
patt_13;

I have a fix for that (testing separately).  With this fix alone we
still vectorize as well.

Reply via email to