On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > <wschm...@linux.vnet.ibm.com> wrote: > > The fix for PR53331 caused a degradation to 187.facerec on > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > Bootstrapped and regtested on that platform with no new regressions. Ok > > for trunk? > > Well, would the real cost not be subparts * scalar_to_vec plus > subparts * vec_perm? > At least vec_perm isn't the cost for building up a vector from N scalar > elements > either (it might be close enough if subparts == 2). What's the case > with facerec > here? Does it have subparts == 2?
In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On PowerPC, this requires two merge instructions and a permute instruction to get the four 32-bit quantities into the right place in a 128-bit register. Currently this is modeled as a vec_perm in other parts of the vectorizer per Ira's earlier patches, so I naively changed this to do the same thing. The types of vectorizer instructions aren't documented, and I can't infer much from the i386.c cost model, so I need a little education. What semantics are represented by scalar_to_vec? On PowerPC, we have this mapping of the floating-point registers and vector float registers where they overlap (the low-order half of each of the first 32 vector float regs corresponds to a scalar float reg). So in this case we have four scalar loads that place things in the bottom half of four vector registers, two vector merge instructions that collapse the four registers into two vector registers, and a vector permute that gets things in the right order.(*) I wonder if what we refer to as a merge instruction is similar to scalar_to_vec. If so, then maybe we need something like subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; inside_cost += ncopies * vect_get_stmt_cost (vec_perm); But then we'd have to change how vec_perm is modeled elsewhere for PowerPC based on Ira's earlier patches. As I said, it's difficult for me to figure out all the intent of cost model decisions that have been made in the past, using current documentation. > I really wanted to pessimize this case > for say AVX and char elements, thus building up a vector from 32 scalars which > certainly does not cost a mere vec_perm. So, maybe special-case the > subparts == 2 case and assume vec_perm would match the cost only in that > case. (I'm a little confused by this as what you have at the moment is a single scalar_to_vec per copy, which has a cost of 1 on most i386 targets (occasionally 2). The subparts multiplier is only applied to the loads. So changing this to vec_perm seemed to be a no-op for i386.) (*) There are actually a couple more instructions here to convert 64-bit values to 32-bit values, since on PowerPC 32-bit loads are converted to 64-bit values in scalar float registers and they have to be coerced back to 32-bit. Very ugly. The cost model currently doesn't represent this at all, which I'll have to look at fixing at some point in some way that isn't too nasty for the other targets. The cost model for PowerPC seems to need a lot of TLC. Thanks, Bill > > Thanks, > Richard. > > > Thanks, > > Bill > > > > > > 2012-06-10 Bill Schmidt <wschm...@linux.ibm.com> > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > for strided loads. > > > > > > Index: gcc/tree-vect-stmts.c > > =================================================================== > > --- gcc/tree-vect-stmts.c (revision 188341) > > +++ gcc/tree-vect-stmts.c (working copy) > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > > /* The loads themselves. */ > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > { > > - /* N scalar loads plus gathering them into a vector. > > - ??? scalar_to_vec isn't the cost for that. */ > > + /* N scalar loads plus gathering them into a vector. */ > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE > > (stmt_info))); > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > } > > else > > vect_get_load_cost (first_dr, ncopies, > > > > >