Re: [PATCH] Correct cost model for strided loads

William J. Schmidt Mon, 11 Jun 2012 06:54:57 -0700


On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> <[email protected]> wrote:
> > The fix for PR53331 caused a degradation to 187.facerec on
> > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > for trunk?
> 
> Well, would the real cost not be subparts * scalar_to_vec plus
> subparts * vec_perm?
> At least vec_perm isn't the cost for building up a vector from N scalar 
> elements
> either (it might be close enough if subparts == 2).  What's the case
> with facerec
> here?  Does it have subparts == 2?


In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
PowerPC, this requires two merge instructions and a permute instruction
to get the four 32-bit quantities into the right place in a 128-bit
register.  Currently this is modeled as a vec_perm in other parts of the
vectorizer per Ira's earlier patches, so I naively changed this to do
the same thing.

The types of vectorizer instructions aren't documented, and I can't
infer much from the i386.c cost model, so I need a little education.
What semantics are represented by scalar_to_vec?

On PowerPC, we have this mapping of the floating-point registers and
vector float registers where they overlap (the low-order half of each of
the first 32 vector float regs corresponds to a scalar float reg).  So
in this case we have four scalar loads that place things in the bottom
half of four vector registers, two vector merge instructions that
collapse the four registers into two vector registers, and a vector
permute that gets things in the right order.(*)  I wonder if what we
refer to as a merge instruction is similar to scalar_to_vec.

If so, then maybe we need something like

     subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
     inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
     inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
     inside_cost += ncopies * vect_get_stmt_cost (vec_perm);

But then we'd have to change how vec_perm is modeled elsewhere for
PowerPC based on Ira's earlier patches.  As I said, it's difficult for
me to figure out all the intent of cost model decisions that have been
made in the past, using current documentation.

> I really wanted to pessimize this case
> for say AVX and char elements, thus building up a vector from 32 scalars which
> certainly does not cost a mere vec_perm.  So, maybe special-case the
> subparts == 2 case and assume vec_perm would match the cost only in that
> case.

(I'm a little confused by this as what you have at the moment is a
single scalar_to_vec per copy, which has a cost of 1 on most i386
targets (occasionally 2).  The subparts multiplier is only applied to
the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

(*) There are actually a couple more instructions here to convert 64-bit
values to 32-bit values, since on PowerPC 32-bit loads are converted to
64-bit values in scalar float registers and they have to be coerced back
to 32-bit.  Very ugly.  The cost model currently doesn't represent this
at all, which I'll have to look at fixing at some point in some way that
isn't too nasty for the other targets.  The cost model for PowerPC seems
to need a lot of TLC.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> >
> >
> > 2012-06-10  Bill Schmidt  <[email protected]>
> >
> >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> >        for strided loads.
> >
> >
> > Index: gcc/tree-vect-stmts.c
> > ===================================================================
> > --- gcc/tree-vect-stmts.c       (revision 188341)
> > +++ gcc/tree-vect-stmts.c       (working copy)
> > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> >   /* The loads themselves.  */
> >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> >     {
> > -      /* N scalar loads plus gathering them into a vector.
> > -         ???  scalar_to_vec isn't the cost for that.  */
> > +      /* N scalar loads plus gathering them into a vector.  */
> >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
> > (stmt_info)));
> > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> >     }
> >   else
> >     vect_get_load_cost (first_dr, ncopies,
> >
> >
>

Re: [PATCH] Correct cost model for strided loads

Reply via email to