Re: [PATCH] Correct cost model for strided loads

2012-06-12 Thread Richard Guenther
On Mon, Jun 11, 2012 at 8:37 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:


 On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, William J. Schmidt wrote:

 
 
  On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
   On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  
Ok
for trunk?
  
   Well, would the real cost not be subparts * scalar_to_vec plus
   subparts * vec_perm?
   At least vec_perm isn't the cost for building up a vector from N scalar 
   elements
   either (it might be close enough if subparts == 2).  What's the case
   with facerec
   here?  Does it have subparts == 2?
 
  In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
  PowerPC, this requires two merge instructions and a permute instruction
  to get the four 32-bit quantities into the right place in a 128-bit
  register.  Currently this is modeled as a vec_perm in other parts of the
  vectorizer per Ira's earlier patches, so I naively changed this to do
  the same thing.

 I see.

  The types of vectorizer instructions aren't documented, and I can't
  infer much from the i386.c cost model, so I need a little education.
  What semantics are represented by scalar_to_vec?

 It's a vector splat, thus x - { x, x, x, ... }.  You can create
 { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
 as VEC_PERM_EXPR, takes two input vectors).  That's by far not
 the most efficient way to build up such a vector of course (with AVX
 you could do one splat plus N - 1 inserts for example).  The cost
 is of course dependent on the number of vector elements, so a
 simple new enum vect_cost_for_stmt kind does not cover it but
 the target would have to look at the vector type passed and do
 some reasonable guess.

 Ah, splat!  Yes, that's lingo I understand.  I see the intent now.


  On PowerPC, we have this mapping of the floating-point registers and
  vector float registers where they overlap (the low-order half of each of
  the first 32 vector float regs corresponds to a scalar float reg).  So
  in this case we have four scalar loads that place things in the bottom
  half of four vector registers, two vector merge instructions that
  collapse the four registers into two vector registers, and a vector
  permute that gets things in the right order.(*)  I wonder if what we
  refer to as a merge instruction is similar to scalar_to_vec.

 Looks similar to x86 SSE then.

  If so, then maybe we need something like
 
       subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
       inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
       inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * 
  subparts / 2;
       inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
 
  But then we'd have to change how vec_perm is modeled elsewhere for
  PowerPC based on Ira's earlier patches.  As I said, it's difficult for
  me to figure out all the intent of cost model decisions that have been
  made in the past, using current documentation.

 Heh, usually the intent was to make the changes simple, not to compute
 a proper cost.

 I think we simply need a new scalars_to_vec cost kind.

 That works.  Maybe vec_construct gets the point across a little better?
 I think we need to use the full builtin_vectorization_cost interface
 instead of vect_get_stmt_cost here, so the targets can parameterize on
 type.  Then we can just do one cost calculation for vec_construct that
 covers the full costs of getting the vector in order after the loads.

 Yes, vec_construct sounds good to me.


   I really wanted to pessimize this case
   for say AVX and char elements, thus building up a vector from 32 
   scalars which
   certainly does not cost a mere vec_perm.  So, maybe special-case the
   subparts == 2 case and assume vec_perm would match the cost only in that
   case.
 
  (I'm a little confused by this as what you have at the moment is a
  single scalar_to_vec per copy, which has a cost of 1 on most i386
  targets (occasionally 2).  The subparts multiplier is only applied to
  the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

 Oh, I somehow read the patch as you were removing the multiplication
 by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
 to reflect it with N scalar loads plus N splats plus N - 1 permutes
 originally.  You could also model it with N scalar loads plus N
 inserts (but we don't have a vec_insert cost either).  I think adding
 a scalars_to_vec or vec_init or however we want to call it - 

Re: [PATCH] Correct cost model for strided loads

2012-06-12 Thread William J. Schmidt
On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote:

 Btw, with PR53533 I now have a case where multiplications of v4si are
 really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
 and no further subdivision ...
 
 Thus we'd need a tree_code argument to the cost hook.  Though it gets
 quite overloaded then, so maybe splitting it into one handling loads/stores
 (and get the misalign parameter) and one handling only vector_stmt but
 with a tree_code argument.  Or splitting it even further, seeing
 cond_branch_taken ...

Yes, I think subdividing the hook for the vector_stmt kind is pretty
much inevitable -- more situations like this expensive multiply will
arise.  I agree with the interface starting to get messy also.
Splitting it is probably the way to go -- a little painful but keeping
it all in one hook is going to get ugly.

Bill

 
 Richard.
 



Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread Richard Guenther
On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
wschm...@linux.vnet.ibm.com wrote:
 The fix for PR53331 caused a degradation to 187.facerec on
 powerpc64-unknown-linux-gnu.  The following simple patch reverses the
 degradation without otherwise affecting SPEC cpu2000 or cpu2006.
 Bootstrapped and regtested on that platform with no new regressions.  Ok
 for trunk?

Well, would the real cost not be subparts * scalar_to_vec plus
subparts * vec_perm?
At least vec_perm isn't the cost for building up a vector from N scalar elements
either (it might be close enough if subparts == 2).  What's the case
with facerec
here?  Does it have subparts == 2?  I really wanted to pessimize this case
for say AVX and char elements, thus building up a vector from 32 scalars which
certainly does not cost a mere vec_perm.  So, maybe special-case the
subparts == 2 case and assume vec_perm would match the cost only in that
case.

Thanks,
Richard.

 Thanks,
 Bill


 2012-06-10  Bill Schmidt  wschm...@linux.ibm.com

        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
        for strided loads.


 Index: gcc/tree-vect-stmts.c
 ===
 --- gcc/tree-vect-stmts.c       (revision 188341)
 +++ gcc/tree-vect-stmts.c       (working copy)
 @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
     {
 -      /* N scalar loads plus gathering them into a vector.
 -         ???  scalar_to_vec isn't the cost for that.  */
 +      /* N scalar loads plus gathering them into a vector.  */
       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
 -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
 +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
     }
   else
     vect_get_load_cost (first_dr, ncopies,




Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt


On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
 On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
 wschm...@linux.vnet.ibm.com wrote:
  The fix for PR53331 caused a degradation to 187.facerec on
  powerpc64-unknown-linux-gnu.  The following simple patch reverses the
  degradation without otherwise affecting SPEC cpu2000 or cpu2006.
  Bootstrapped and regtested on that platform with no new regressions.  Ok
  for trunk?
 
 Well, would the real cost not be subparts * scalar_to_vec plus
 subparts * vec_perm?
 At least vec_perm isn't the cost for building up a vector from N scalar 
 elements
 either (it might be close enough if subparts == 2).  What's the case
 with facerec
 here?  Does it have subparts == 2?  

In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
PowerPC, this requires two merge instructions and a permute instruction
to get the four 32-bit quantities into the right place in a 128-bit
register.  Currently this is modeled as a vec_perm in other parts of the
vectorizer per Ira's earlier patches, so I naively changed this to do
the same thing.

The types of vectorizer instructions aren't documented, and I can't
infer much from the i386.c cost model, so I need a little education.
What semantics are represented by scalar_to_vec?

On PowerPC, we have this mapping of the floating-point registers and
vector float registers where they overlap (the low-order half of each of
the first 32 vector float regs corresponds to a scalar float reg).  So
in this case we have four scalar loads that place things in the bottom
half of four vector registers, two vector merge instructions that
collapse the four registers into two vector registers, and a vector
permute that gets things in the right order.(*)  I wonder if what we
refer to as a merge instruction is similar to scalar_to_vec.

If so, then maybe we need something like

 subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
 inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
 inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
 inside_cost += ncopies * vect_get_stmt_cost (vec_perm);

But then we'd have to change how vec_perm is modeled elsewhere for
PowerPC based on Ira's earlier patches.  As I said, it's difficult for
me to figure out all the intent of cost model decisions that have been
made in the past, using current documentation.

 I really wanted to pessimize this case
 for say AVX and char elements, thus building up a vector from 32 scalars which
 certainly does not cost a mere vec_perm.  So, maybe special-case the
 subparts == 2 case and assume vec_perm would match the cost only in that
 case.

(I'm a little confused by this as what you have at the moment is a
single scalar_to_vec per copy, which has a cost of 1 on most i386
targets (occasionally 2).  The subparts multiplier is only applied to
the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

(*) There are actually a couple more instructions here to convert 64-bit
values to 32-bit values, since on PowerPC 32-bit loads are converted to
64-bit values in scalar float registers and they have to be coerced back
to 32-bit.  Very ugly.  The cost model currently doesn't represent this
at all, which I'll have to look at fixing at some point in some way that
isn't too nasty for the other targets.  The cost model for PowerPC seems
to need a lot of TLC.

Thanks,
Bill

 
 Thanks,
 Richard.
 
  Thanks,
  Bill
 
 
  2012-06-10  Bill Schmidt  wschm...@linux.ibm.com
 
 * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
 for strided loads.
 
 
  Index: gcc/tree-vect-stmts.c
  ===
  --- gcc/tree-vect-stmts.c   (revision 188341)
  +++ gcc/tree-vect-stmts.c   (working copy)
  @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
/* The loads themselves.  */
if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
  {
  -  /* N scalar loads plus gathering them into a vector.
  - ???  scalar_to_vec isn't the cost for that.  */
  +  /* N scalar loads plus gathering them into a vector.  */
inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
   * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
  (stmt_info)));
  -  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
  +  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
  }
else
  vect_get_load_cost (first_dr, ncopies,
 
 
 



Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread Richard Guenther
On Mon, 11 Jun 2012, William J. Schmidt wrote:

 
 
 On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
  On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
  wschm...@linux.vnet.ibm.com wrote:
   The fix for PR53331 caused a degradation to 187.facerec on
   powerpc64-unknown-linux-gnu.  The following simple patch reverses the
   degradation without otherwise affecting SPEC cpu2000 or cpu2006.
   Bootstrapped and regtested on that platform with no new regressions.  Ok
   for trunk?
  
  Well, would the real cost not be subparts * scalar_to_vec plus
  subparts * vec_perm?
  At least vec_perm isn't the cost for building up a vector from N scalar 
  elements
  either (it might be close enough if subparts == 2).  What's the case
  with facerec
  here?  Does it have subparts == 2?  
 
 In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
 PowerPC, this requires two merge instructions and a permute instruction
 to get the four 32-bit quantities into the right place in a 128-bit
 register.  Currently this is modeled as a vec_perm in other parts of the
 vectorizer per Ira's earlier patches, so I naively changed this to do
 the same thing.

I see.

 The types of vectorizer instructions aren't documented, and I can't
 infer much from the i386.c cost model, so I need a little education.
 What semantics are represented by scalar_to_vec?

It's a vector splat, thus x - { x, x, x, ... }.  You can create
{ x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
as VEC_PERM_EXPR, takes two input vectors).  That's by far not
the most efficient way to build up such a vector of course (with AVX
you could do one splat plus N - 1 inserts for example).  The cost
is of course dependent on the number of vector elements, so a
simple new enum vect_cost_for_stmt kind does not cover it but
the target would have to look at the vector type passed and do
some reasonable guess.

 On PowerPC, we have this mapping of the floating-point registers and
 vector float registers where they overlap (the low-order half of each of
 the first 32 vector float regs corresponds to a scalar float reg).  So
 in this case we have four scalar loads that place things in the bottom
 half of four vector registers, two vector merge instructions that
 collapse the four registers into two vector registers, and a vector
 permute that gets things in the right order.(*)  I wonder if what we
 refer to as a merge instruction is similar to scalar_to_vec.

Looks similar to x86 SSE then.

 If so, then maybe we need something like
 
  subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
  inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 
 2;
  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
 
 But then we'd have to change how vec_perm is modeled elsewhere for
 PowerPC based on Ira's earlier patches.  As I said, it's difficult for
 me to figure out all the intent of cost model decisions that have been
 made in the past, using current documentation.

Heh, usually the intent was to make the changes simple, not to compute
a proper cost.

I think we simply need a new scalars_to_vec cost kind.

  I really wanted to pessimize this case
  for say AVX and char elements, thus building up a vector from 32 scalars 
  which
  certainly does not cost a mere vec_perm.  So, maybe special-case the
  subparts == 2 case and assume vec_perm would match the cost only in that
  case.
 
 (I'm a little confused by this as what you have at the moment is a
 single scalar_to_vec per copy, which has a cost of 1 on most i386
 targets (occasionally 2).  The subparts multiplier is only applied to
 the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

Oh, I somehow read the patch as you were removing the multiplication
by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
to reflect it with N scalar loads plus N splats plus N - 1 permutes
originally.  You could also model it with N scalar loads plus N
inserts (but we don't have a vec_insert cost either).  I think adding
a scalars_to_vec or vec_init or however we want to call it - basically
what the cost of a vector CONSTRUCTOR would be - is best.

 (*) There are actually a couple more instructions here to convert 64-bit
 values to 32-bit values, since on PowerPC 32-bit loads are converted to
 64-bit values in scalar float registers and they have to be coerced back
 to 32-bit.  Very ugly.  The cost model currently doesn't represent this
 at all, which I'll have to look at fixing at some point in some way that
 isn't too nasty for the other targets.  The cost model for PowerPC seems
 to need a lot of TLC.

Maybe the above would be a possible way to do it.  On x86 for example
a vector of two doubles is extremely cheap for generic SSE2 to construct,
we can directly load into the low/high part, thus when we use it as
N scalar loads plus the vector-build the 

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt


On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, William J. Schmidt wrote:
 
  
  
  On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
   On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  Ok
for trunk?
   
   Well, would the real cost not be subparts * scalar_to_vec plus
   subparts * vec_perm?
   At least vec_perm isn't the cost for building up a vector from N scalar 
   elements
   either (it might be close enough if subparts == 2).  What's the case
   with facerec
   here?  Does it have subparts == 2?  
  
  In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
  PowerPC, this requires two merge instructions and a permute instruction
  to get the four 32-bit quantities into the right place in a 128-bit
  register.  Currently this is modeled as a vec_perm in other parts of the
  vectorizer per Ira's earlier patches, so I naively changed this to do
  the same thing.
 
 I see.
 
  The types of vectorizer instructions aren't documented, and I can't
  infer much from the i386.c cost model, so I need a little education.
  What semantics are represented by scalar_to_vec?
 
 It's a vector splat, thus x - { x, x, x, ... }.  You can create
 { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
 as VEC_PERM_EXPR, takes two input vectors).  That's by far not
 the most efficient way to build up such a vector of course (with AVX
 you could do one splat plus N - 1 inserts for example).  The cost
 is of course dependent on the number of vector elements, so a
 simple new enum vect_cost_for_stmt kind does not cover it but
 the target would have to look at the vector type passed and do
 some reasonable guess.

Ah, splat!  Yes, that's lingo I understand.  I see the intent now.

 
  On PowerPC, we have this mapping of the floating-point registers and
  vector float registers where they overlap (the low-order half of each of
  the first 32 vector float regs corresponds to a scalar float reg).  So
  in this case we have four scalar loads that place things in the bottom
  half of four vector registers, two vector merge instructions that
  collapse the four registers into two vector registers, and a vector
  permute that gets things in the right order.(*)  I wonder if what we
  refer to as a merge instruction is similar to scalar_to_vec.
 
 Looks similar to x86 SSE then.
 
  If so, then maybe we need something like
  
   subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
   inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
   inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts 
  / 2;
   inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
  
  But then we'd have to change how vec_perm is modeled elsewhere for
  PowerPC based on Ira's earlier patches.  As I said, it's difficult for
  me to figure out all the intent of cost model decisions that have been
  made in the past, using current documentation.
 
 Heh, usually the intent was to make the changes simple, not to compute
 a proper cost.
 
 I think we simply need a new scalars_to_vec cost kind.

That works.  Maybe vec_construct gets the point across a little better?
I think we need to use the full builtin_vectorization_cost interface
instead of vect_get_stmt_cost here, so the targets can parameterize on
type.  Then we can just do one cost calculation for vec_construct that
covers the full costs of getting the vector in order after the loads.

 
   I really wanted to pessimize this case
   for say AVX and char elements, thus building up a vector from 32 scalars 
   which
   certainly does not cost a mere vec_perm.  So, maybe special-case the
   subparts == 2 case and assume vec_perm would match the cost only in that
   case.
  
  (I'm a little confused by this as what you have at the moment is a
  single scalar_to_vec per copy, which has a cost of 1 on most i386
  targets (occasionally 2).  The subparts multiplier is only applied to
  the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
 
 Oh, I somehow read the patch as you were removing the multiplication
 by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
 to reflect it with N scalar loads plus N splats plus N - 1 permutes
 originally.  You could also model it with N scalar loads plus N
 inserts (but we don't have a vec_insert cost either).  I think adding
 a scalars_to_vec or vec_init or however we want to call it - basically
 what the cost of a vector CONSTRUCTOR would be - is best.
 
  (*) There are actually a couple more instructions here to convert 64-bit
  values to 32-bit values, since on PowerPC 32-bit loads are 

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread Richard Guenther
On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
wschm...@linux.vnet.ibm.com wrote:


 On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
 On Mon, 11 Jun 2012, William J. Schmidt wrote:

 
 
  On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
   On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
   wschm...@linux.vnet.ibm.com wrote:
The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  
Ok
for trunk?
  
   Well, would the real cost not be subparts * scalar_to_vec plus
   subparts * vec_perm?
   At least vec_perm isn't the cost for building up a vector from N scalar 
   elements
   either (it might be close enough if subparts == 2).  What's the case
   with facerec
   here?  Does it have subparts == 2?
 
  In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
  PowerPC, this requires two merge instructions and a permute instruction
  to get the four 32-bit quantities into the right place in a 128-bit
  register.  Currently this is modeled as a vec_perm in other parts of the
  vectorizer per Ira's earlier patches, so I naively changed this to do
  the same thing.

 I see.

  The types of vectorizer instructions aren't documented, and I can't
  infer much from the i386.c cost model, so I need a little education.
  What semantics are represented by scalar_to_vec?

 It's a vector splat, thus x - { x, x, x, ... }.  You can create
 { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
 as VEC_PERM_EXPR, takes two input vectors).  That's by far not
 the most efficient way to build up such a vector of course (with AVX
 you could do one splat plus N - 1 inserts for example).  The cost
 is of course dependent on the number of vector elements, so a
 simple new enum vect_cost_for_stmt kind does not cover it but
 the target would have to look at the vector type passed and do
 some reasonable guess.

 Ah, splat!  Yes, that's lingo I understand.  I see the intent now.


  On PowerPC, we have this mapping of the floating-point registers and
  vector float registers where they overlap (the low-order half of each of
  the first 32 vector float regs corresponds to a scalar float reg).  So
  in this case we have four scalar loads that place things in the bottom
  half of four vector registers, two vector merge instructions that
  collapse the four registers into two vector registers, and a vector
  permute that gets things in the right order.(*)  I wonder if what we
  refer to as a merge instruction is similar to scalar_to_vec.

 Looks similar to x86 SSE then.

  If so, then maybe we need something like
 
       subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
       inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
       inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * 
  subparts / 2;
       inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
 
  But then we'd have to change how vec_perm is modeled elsewhere for
  PowerPC based on Ira's earlier patches.  As I said, it's difficult for
  me to figure out all the intent of cost model decisions that have been
  made in the past, using current documentation.

 Heh, usually the intent was to make the changes simple, not to compute
 a proper cost.

 I think we simply need a new scalars_to_vec cost kind.

 That works.  Maybe vec_construct gets the point across a little better?
 I think we need to use the full builtin_vectorization_cost interface
 instead of vect_get_stmt_cost here, so the targets can parameterize on
 type.  Then we can just do one cost calculation for vec_construct that
 covers the full costs of getting the vector in order after the loads.

Yes, vec_construct sounds good to me.


   I really wanted to pessimize this case
   for say AVX and char elements, thus building up a vector from 32 scalars 
   which
   certainly does not cost a mere vec_perm.  So, maybe special-case the
   subparts == 2 case and assume vec_perm would match the cost only in that
   case.
 
  (I'm a little confused by this as what you have at the moment is a
  single scalar_to_vec per copy, which has a cost of 1 on most i386
  targets (occasionally 2).  The subparts multiplier is only applied to
  the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

 Oh, I somehow read the patch as you were removing the multiplication
 by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
 to reflect it with N scalar loads plus N splats plus N - 1 permutes
 originally.  You could also model it with N scalar loads plus N
 inserts (but we don't have a vec_insert cost either).  I think adding
 a scalars_to_vec or vec_init or however we want to call it - basically
 what the cost of a vector CONSTRUCTOR would be - is best.

  (*) There are actually