http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031

Ira Rosen <irar at il dot ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm.com

--- Comment #2 from Ira Rosen <irar at il dot ibm.com> 2011-08-10 06:36:26 UTC 
---
(In reply to comment #0)

> 
> 1) When tree-vec-data-refs doesn't know the alignment of memory in a loop that
> is vectorized, and the machine has a vec_realign_load_<type> pattern, the loop
> that is generated always uses the unalgined load, even though it might be 
> slow.
>  On the power7, the realign code uses a vector load and the lvsr instruction 
> to
> create a permute mask, and then in the inner loop, after each load, use the
> permute mask to do the unaligned loads.  Thus, in the loop, before doing the
> conversions, we will be doing 4 vector loads, and 4 permutes.  The vector
> conversion from 32-bit to 64-bit, involve two more permutes to split the V4SF
> values into the appropriate registers before doing the float->double convert. 
> Thus in the loop we will have 4 permutes for the 4 loads that are done, and 8
> permutes for the conversions.  The power7 only has one permute functional 
> unit,
> and multiple permutes can slow things down.  The code has one segment with 3
> back to back permutes, and another with 6 back to back permutes.
> 
> If vectorizer could clone the loop and on one side test to see if the pointers
> are aligned, and if the pointers are aligned, do the aligned loads, and on the
> other side, do unalgined loads it would help.  I experimented with an option 
> to
> disable the vec_realign_load_<type> pattern, and it helped this particular
> benchmark, but hurt other benchmarks, because the code would do the vectorized
> loop only if the pointers are aligned, and fell back to scalar loop if they
> were unaligned.  I would think falling back to use vec_realign_<xxx> would be 
> a
> win.

Yes, this kind of versioning is a good idea. I have it implemented on
st/cli-be-vect branch, but it would be probably hard to extract this patch from
there. I'll take a look.

> 
> 2) In looking at the documentation, I discovered that vec_realign_<xxx> is not
> documented in md.texi.
> 
> 3) The powerpc backend doesn't realize it could use the Altivec memory
> instruction to load memory (since the Altivec load implicitly ignores the
> bottom bits of the address).
> 
> 4) The code in tree-vec-stmts.c, tree-vec-slp.c, and tree-vec-loop.c that 
> calls
> the vectorization cost target hook, never pass in the actual type to the
> argument vectype, or set the misalign argument to non-zero.  I would imagine
> that vector systems might have different costs, depending on the type.  Maybe
> the two arguments should be eliminated if we aren't going to pass useful
> information.  In addition, there doesn't seem to be a cost of doing
> vec_realign.  There is cost for unaligned loads (via movmisalign), but there
> doesn't seem to be a cost for realignment.

We pass the type and the misalignment value in vect_get_store_cost and
vect_get_load_cost, the only places that they are relevant. It might be true
that the actual costs depend on the type, but the cost model is only an
evaluation and it is based on tuning, so I guess until now nobody thought that
it is useful. 

The type and the misalignment value are important for VSX in movmisalign case,
so the cost for a data access takes them into account in
rs6000_builtin_vectorization_cost.

The cost of realignment is calculated in vect_get_load_cost under 'case
dr_explicit_realign' and 'case dr_explicit_realign_optimized'. I noticed now
that it uses just vector_stmt type instead of vec_perm, so it should be fixed
like that:

Index: tree-vect-stmts.c
===================================================================
--- tree-vect-stmts.c   (revision 177586)
+++ tree-vect-stmts.c   (working copy)
@@ -1011,7 +1011,7 @@ vect_get_load_cost (struct data_referenc
     case dr_explicit_realign:
       {
         *inside_cost += ncopies * (2 * vect_get_stmt_cost (vector_load)
-           + vect_get_stmt_cost (vector_stmt));
+           + vect_get_stmt_cost (vec_perm));

         /* FIXME: If the misalignment remains fixed across the iterations of
            the containing loop, the following cost should be added to the
@@ -1042,7 +1042,7 @@ vect_get_load_cost (struct data_referenc
           }

         *inside_cost += ncopies * (vect_get_stmt_cost (vector_load)
-          + vect_get_stmt_cost (vector_stmt));
+          + vect_get_stmt_cost (vec_perm));
         break;
       }

but since these costs are equal (at least in rs6000) it will not change
anything unless the costs are changed.

Ira

Reply via email to