On Mon, Jun 11, 2012 at 8:37 PM, Richard Guenther <richard.guent...@gmail.com> wrote: > On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt > <wschm...@linux.vnet.ibm.com> wrote: >> >> >> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: >>> On Mon, 11 Jun 2012, William J. Schmidt wrote: >>> >>> > >>> > >>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: >>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt >>> > > <wschm...@linux.vnet.ibm.com> wrote: >>> > > > The fix for PR53331 caused a degradation to 187.facerec on >>> > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the >>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. >>> > > > Bootstrapped and regtested on that platform with no new regressions. >>> > > > Ok >>> > > > for trunk? >>> > > >>> > > Well, would the real cost not be subparts * scalar_to_vec plus >>> > > subparts * vec_perm? >>> > > At least vec_perm isn't the cost for building up a vector from N scalar >>> > > elements >>> > > either (it might be close enough if subparts == 2). What's the case >>> > > with facerec >>> > > here? Does it have subparts == 2? >>> > >>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On >>> > PowerPC, this requires two merge instructions and a permute instruction >>> > to get the four 32-bit quantities into the right place in a 128-bit >>> > register. Currently this is modeled as a vec_perm in other parts of the >>> > vectorizer per Ira's earlier patches, so I naively changed this to do >>> > the same thing. >>> >>> I see. >>> >>> > The types of vectorizer instructions aren't documented, and I can't >>> > infer much from the i386.c cost model, so I need a little education. >>> > What semantics are represented by scalar_to_vec? >>> >>> It's a vector splat, thus x -> { x, x, x, ... }. You can create >>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, >>> as VEC_PERM_EXPR, takes two input vectors). That's by far not >>> the most efficient way to build up such a vector of course (with AVX >>> you could do one splat plus N - 1 inserts for example). The cost >>> is of course dependent on the number of vector elements, so a >>> simple new enum vect_cost_for_stmt kind does not cover it but >>> the target would have to look at the vector type passed and do >>> some reasonable guess. >> >> Ah, splat! Yes, that's lingo I understand. I see the intent now. >> >>> >>> > On PowerPC, we have this mapping of the floating-point registers and >>> > vector float registers where they overlap (the low-order half of each of >>> > the first 32 vector float regs corresponds to a scalar float reg). So >>> > in this case we have four scalar loads that place things in the bottom >>> > half of four vector registers, two vector merge instructions that >>> > collapse the four registers into two vector registers, and a vector >>> > permute that gets things in the right order.(*) I wonder if what we >>> > refer to as a merge instruction is similar to scalar_to_vec. >>> >>> Looks similar to x86 SSE then. >>> >>> > If so, then maybe we need something like >>> > >>> > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>> > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; >>> > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * >>> > subparts / 2; >>> > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >>> > >>> > But then we'd have to change how vec_perm is modeled elsewhere for >>> > PowerPC based on Ira's earlier patches. As I said, it's difficult for >>> > me to figure out all the intent of cost model decisions that have been >>> > made in the past, using current documentation. >>> >>> Heh, usually the intent was to make the changes simple, not to compute >>> a proper cost. >>> >>> I think we simply need a new scalars_to_vec cost kind. >> >> That works. Maybe vec_construct gets the point across a little better? >> I think we need to use the full builtin_vectorization_cost interface >> instead of vect_get_stmt_cost here, so the targets can parameterize on >> type. Then we can just do one cost calculation for vec_construct that >> covers the full costs of getting the vector in order after the loads. > > Yes, vec_construct sounds good to me. > >>> >>> > > I really wanted to pessimize this case >>> > > for say AVX and char elements, thus building up a vector from 32 >>> > > scalars which >>> > > certainly does not cost a mere vec_perm. So, maybe special-case the >>> > > subparts == 2 case and assume vec_perm would match the cost only in that >>> > > case. >>> > >>> > (I'm a little confused by this as what you have at the moment is a >>> > single scalar_to_vec per copy, which has a cost of 1 on most i386 >>> > targets (occasionally 2). The subparts multiplier is only applied to >>> > the loads. So changing this to vec_perm seemed to be a no-op for i386.) >>> >>> Oh, I somehow read the patch as you were removing the multiplication >>> by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted >>> to reflect it with N scalar loads plus N splats plus N - 1 permutes >>> originally. You could also model it with N scalar loads plus N >>> inserts (but we don't have a vec_insert cost either). I think adding >>> a scalars_to_vec or vec_init or however we want to call it - basically >>> what the cost of a vector CONSTRUCTOR would be - is best. >>> >>> > (*) There are actually a couple more instructions here to convert 64-bit >>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to >>> > 64-bit values in scalar float registers and they have to be coerced back >>> > to 32-bit. Very ugly. The cost model currently doesn't represent this >>> > at all, which I'll have to look at fixing at some point in some way that >>> > isn't too nasty for the other targets. The cost model for PowerPC seems >>> > to need a lot of TLC. >>> >>> Maybe the above would be a possible way to do it. On x86 for example >>> a vector of two doubles is extremely cheap for generic SSE2 to construct, >>> we can directly load into the low/high part, thus when we use it as >>> N scalar loads plus the vector-build the vector-build would be free. >> >> Absolutely. As long as the vector type is available, this can be built >> into vec_construct's logic in the target. I will feel much more >> comfortable having better estimation for the ugly 32-bit case. > > Indeed.
Btw, with PR53533 I now have a case where multiplications of v4si are really expensive on x86 without SSE 4.1. But we only have vect_stmt_cost and no further subdivision ... Thus we'd need a tree_code argument to the cost hook. Though it gets quite overloaded then, so maybe splitting it into one handling loads/stores (and get the misalign parameter) and one handling only vector_stmt but with a tree_code argument. Or splitting it even further, seeing cond_branch_taken ... Richard.