On Mon, Jun 11, 2012 at 8:37 PM, Richard Guenther
<richard.guent...@gmail.com> wrote:
> On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
> <wschm...@linux.vnet.ibm.com> wrote:
>>
>>
>> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
>>> On Mon, 11 Jun 2012, William J. Schmidt wrote:
>>>
>>> >
>>> >
>>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
>>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
>>> > > <wschm...@linux.vnet.ibm.com> wrote:
>>> > > > The fix for PR53331 caused a degradation to 187.facerec on
>>> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
>>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
>>> > > > Bootstrapped and regtested on that platform with no new regressions.  
>>> > > > Ok
>>> > > > for trunk?
>>> > >
>>> > > Well, would the real cost not be subparts * scalar_to_vec plus
>>> > > subparts * vec_perm?
>>> > > At least vec_perm isn't the cost for building up a vector from N scalar 
>>> > > elements
>>> > > either (it might be close enough if subparts == 2).  What's the case
>>> > > with facerec
>>> > > here?  Does it have subparts == 2?
>>> >
>>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
>>> > PowerPC, this requires two merge instructions and a permute instruction
>>> > to get the four 32-bit quantities into the right place in a 128-bit
>>> > register.  Currently this is modeled as a vec_perm in other parts of the
>>> > vectorizer per Ira's earlier patches, so I naively changed this to do
>>> > the same thing.
>>>
>>> I see.
>>>
>>> > The types of vectorizer instructions aren't documented, and I can't
>>> > infer much from the i386.c cost model, so I need a little education.
>>> > What semantics are represented by scalar_to_vec?
>>>
>>> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
>>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
>>> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
>>> the most efficient way to build up such a vector of course (with AVX
>>> you could do one splat plus N - 1 inserts for example).  The cost
>>> is of course dependent on the number of vector elements, so a
>>> simple new enum vect_cost_for_stmt kind does not cover it but
>>> the target would have to look at the vector type passed and do
>>> some reasonable guess.
>>
>> Ah, splat!  Yes, that's lingo I understand.  I see the intent now.
>>
>>>
>>> > On PowerPC, we have this mapping of the floating-point registers and
>>> > vector float registers where they overlap (the low-order half of each of
>>> > the first 32 vector float regs corresponds to a scalar float reg).  So
>>> > in this case we have four scalar loads that place things in the bottom
>>> > half of four vector registers, two vector merge instructions that
>>> > collapse the four registers into two vector registers, and a vector
>>> > permute that gets things in the right order.(*)  I wonder if what we
>>> > refer to as a merge instruction is similar to scalar_to_vec.
>>>
>>> Looks similar to x86 SSE then.
>>>
>>> > If so, then maybe we need something like
>>> >
>>> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>>> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * 
>>> > subparts / 2;
>>> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>>> >
>>> > But then we'd have to change how vec_perm is modeled elsewhere for
>>> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
>>> > me to figure out all the intent of cost model decisions that have been
>>> > made in the past, using current documentation.
>>>
>>> Heh, usually the intent was to make the changes simple, not to compute
>>> a proper cost.
>>>
>>> I think we simply need a new scalars_to_vec cost kind.
>>
>> That works.  Maybe vec_construct gets the point across a little better?
>> I think we need to use the full builtin_vectorization_cost interface
>> instead of vect_get_stmt_cost here, so the targets can parameterize on
>> type.  Then we can just do one cost calculation for vec_construct that
>> covers the full costs of getting the vector in order after the loads.
>
> Yes, vec_construct sounds good to me.
>
>>>
>>> > > I really wanted to pessimize this case
>>> > > for say AVX and char elements, thus building up a vector from 32 
>>> > > scalars which
>>> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
>>> > > subparts == 2 case and assume vec_perm would match the cost only in that
>>> > > case.
>>> >
>>> > (I'm a little confused by this as what you have at the moment is a
>>> > single scalar_to_vec per copy, which has a cost of 1 on most i386
>>> > targets (occasionally 2).  The subparts multiplier is only applied to
>>> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
>>>
>>> Oh, I somehow read the patch as you were removing the multiplication
>>> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
>>> to reflect it with N scalar loads plus N splats plus N - 1 permutes
>>> originally.  You could also model it with N scalar loads plus N
>>> inserts (but we don't have a vec_insert cost either).  I think adding
>>> a scalars_to_vec or vec_init or however we want to call it - basically
>>> what the cost of a vector CONSTRUCTOR would be - is best.
>>>
>>> > (*) There are actually a couple more instructions here to convert 64-bit
>>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
>>> > 64-bit values in scalar float registers and they have to be coerced back
>>> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
>>> > at all, which I'll have to look at fixing at some point in some way that
>>> > isn't too nasty for the other targets.  The cost model for PowerPC seems
>>> > to need a lot of TLC.
>>>
>>> Maybe the above would be a possible way to do it.  On x86 for example
>>> a vector of two doubles is extremely cheap for generic SSE2 to construct,
>>> we can directly load into the low/high part, thus when we use it as
>>> N scalar loads plus the vector-build the vector-build would be free.
>>
>> Absolutely.  As long as the vector type is available, this can be built
>> into vec_construct's logic in the target.  I will feel much more
>> comfortable having better estimation for the ugly 32-bit case.
>
> Indeed.

Btw, with PR53533 I now have a case where multiplications of v4si are
really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
and no further subdivision ...

Thus we'd need a tree_code argument to the cost hook.  Though it gets
quite overloaded then, so maybe splitting it into one handling loads/stores
(and get the misalign parameter) and one handling only vector_stmt but
with a tree_code argument.  Or splitting it even further, seeing
cond_branch_taken ...

Richard.

Reply via email to