Re: [PATCH GCC]A simple implementation of loop interchange

2017-11-23 Thread Bin.Cheng
Hi Richard,
Thanks for reviewing.  It's quite lot comment, I am trying to resolve
it one by one.  Here I have some questions as embedded.

On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener
 wrote:
> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng  wrote:
>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz  wrote:
>>
>>>
>>> I hope this is of some help to you :)
>> Thanks again, it's very helpful.
>>
>> I also fixed several bugs of previous implementation, mostly about debug info
>> statements and simple reductions.  As for test, I enabled this pass by 
>> default,
>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>> latent bugs in it, but guess we have to exercise it by enabling it at
>> some point.
>>
>> So any comments?
>
>  bool
> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool 
> insert_dbg)
>  {
>
> that you need this suggests you do stmt removal in wrong order (you need to
> do reverse dom order).
>
> +/* Maximum number of statements in loop nest for loop interchange.  */
> +
> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> + "loop-interchange-max-num-stmts",
> + "The maximum number of stmts in loop nest for loop interchange.",
> + 64, 0, 0)
>
> is that to limit dependence computation?  In this case you should probably
> limit the number of data references instead?
No, this is to limit number of statements in loop.  We don't want to
do interchange for too large loops, right?

>
> +ftree-loop-interchange
> +Common Report Var(flag_tree_loop_interchange) Optimization
> +Enable loop interchange on trees.
> +
>
> please re-use -floop-interchange instead and change the GRAPHITE tests
> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
>
> Please enable the pass by default at O3 via opts.c.
There are quite many (vectorize) test cases affected by interchange
(which is correct I believe), so I will prepare another patch enabling
it at O3 and adjusting the tests, to keep this patch small.

>
> diff --git a/gcc/tree-ssa-loop-interchange.cc 
> b/gcc/tree-ssa-loop-interchange.cc
>
> gimple-loop-interchange.cc please.
>
> new file mode 100644
> index 000..abffbf6
> --- /dev/null
> +++ b/gcc/tree-ssa-loop-interchange.cc
> @@ -0,0 +1,2274 @@
> +/* Loop invariant motion.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
>
> Loop invariant motion? ... ;)
>
> Please add a "Contributed by ..." to have an easy way to figure people to 
> blame.
>
> +}*induction_p;
> +
>
> space after '*'
>
> +}*reduction_p;
> +
>
> likewise.
>
> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
> +   ssa var appearing in any abnormal phi node.  */
> +
> +static inline bool
> +unsupported_phi_node (gphi *phi)
> +{
> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> +return true;
> +
> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +{
> +  tree arg = PHI_ARG_DEF (phi, i);
> +  if (TREE_CODE (arg) == SSA_NAME
> + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> +   return true;
> +}
> +
> +  return false;
>
> I believe the above isn't necessary given you rule out abnormal edges
> into the loop.
> Did you have a testcase that broke?  A minor thing I guess if it is
> just for extra
> safety...
>
> +/* Return true if all stmts in BB can be supported by loop interchange,
> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
> +   outer loop in loop nest.  */
> +
> +bool
> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
> +{
>
> docs and return value suggest this be named supported_operation
>
> +  /* Or it's invariant memory reference and only used by inner loop.  */
> +  if (gimple_assign_single_p (stmt)
> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> + && TREE_CODE (lhs) == SSA_NAME
> + && single_use_in_loop (lhs, iloop->loop))
> +   continue;
>
> comment suggests multiple uses in loop would be ok?
>
> +  if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> + || lhs != re->init)
> +   return;
> +
> +  if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> + || !REFERENCE_CLASS_P (rhs))
> +   return;
>
> lhs and rhs are never NULL.  Please initialize them outside of the if.
> You want to disallow DECL_P rhs here?
>
> Can you add an overall function comment what this function does?  It seems
> to detect a reduction as produced by loop store-motion?  Thus it tries to
> get at enough information to perform the reverse transform?
>
> During review I have a hard time distinguishing between locals and members
> so can you please prefix all member variables with m_ as according to our
> code guidelines?  I guess what adds to the confusion is the loop_cand argument
> that sometimes is present for loop_cand member functions...
> 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-11-20 Thread Richard Biener
On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng  wrote:
> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz  wrote:
>> Hello,
>>
>> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>>
>>> This is updated patch for loop interchange with review suggestions
>>> resolved.  Changes are:
>>>   1) It does more light weight checks like rectangle loop nest check
>>> earlier than before.
>>>   2) It checks profitability of interchange before data dependence 
>>> computation.
>>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>>   4) Data dependence is open-computed so that we can skip instantly at
>>> unknown dependence.
>>>   5) It improves code generation in mapping induction variables for
>>> loop nest, as well as
>>>  adding a simple dead code elimination pass.
>>>   6) It changes magic constants into parameters.
>>
>> So I have a couple comments/questions.  Something stylistic:
> Hi Michael,
> Thanks for reviewing.
>
>>
>>> +class loop_cand
>>> +{
>>> +public:
>>> ...
>>> +  friend class tree_loop_interchange;
>>> +private:
>>
>> Just make this all public (and hence a struct, not class).
>> No need for friends in file local classes.
> Done.
>
>>
>>> +single_use_in_loop (tree var, struct loop *loop)
>>> ...
>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>> +{
>>> +  stmt = USE_STMT (use_p);
>>> ...
>>> +  basic_block bb = gimple_bb (stmt);
>>> +  gcc_assert (bb != NULL);
>>
>> This pattern reoccurs often in your patch: you check for a bb associated
>> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
>> for checking.
> Done.
>
>>
>> Then, something about your handling of simple reductions:
>>
>>> +void
>>> +loop_cand::classify_simple_reduction (reduction_p re)
>>> +{
>>> ...
>>> +  /* Require memory references in producer and consumer are the same so
>>> + that we can undo reduction during interchange.  */
>>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>>> +return;
>>
>> Where is it checked that the undoing transformation is legal also
>> from a data dep point of view?  Think code like this:
>>
>>sum = X[i];
>>for (j ...)
>>  sum += X[j];
>>X[i] = sum;
>>
>> Moving the store into the inner loop isn't always correct and I don't seem
>> to find where the above situation is rejected.
> Yeah.  for the old patch, it's possible to have such loop wrongly 
> interchanged;
> in practice, it's hard to create an example.  The pass will give up
> when computing
> data dep between references in inner/outer loops.  In this updated
> patch, it's fixed
> by giving up if there is any dependence between references of inner/outer 
> loops.
>
>>
>> Maybe I'm confused because I also don't see where you even can get into
>> the above situation (though I do see testcases about this).  The thing is,
>> for an 2d loop nest to contain something like the above reduction it can't
>> be perfect:
>>
>>for (j) {
>>  int sum = X[j];  // 1
>>  for (i)
>>sum += Y[j][i];
>>  X[j] = sum;  // 2
>>}
>>
>> But you do check for perfectness in proper_loop_form_for_interchange and
>> prepare_perfect_loop_nest, so either you can't get into the situation or
>> the checking can't be complete, or you define the above to be perfect
>> nevertheless (probably because the load and store are in outer loop
>> header/exit blocks?).  The latter would mean that you accept also other
>> code in header/footer of loops from a pure CFG perspective, so where is it
>> checked that that other code (which aren't simple reductions) isn't
>> harmful to the transformation?
> Yes, I used the name perfect loop nest, but the pass can handle special form
> imperfect loop nest for the simple reduction.  I added comments describing
> this before function prepare_perfect_loop_nest.
>
>>
>> Then, the data dependence part of the new pass:
>>
>>> +bool
>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned 
>>> outer)
>>> +{
>>> +  struct data_dependence_relation *ddr;
>>> +
>>> +  for (unsigned i = 0; ddrs.iterate (i, ); ++i)
>>> +{
>>> +  /* Skip no-dependence case.  */
>>> +  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> + continue;
>>> +
>>> +  for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>> + {
>>> +   lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>>> +   unsigned level = dependence_level (dist_vect, loop_nest.length ());
>>> +
>>> +   /* If there is no carried dependence.  */
>>> +   if (level == 0)
>>> + continue;
>>> +
>>> +   level --;
>>> +   /* Skip case which has '>' as the leftmost direction.  */
>>> +   if (!lambda_vector_lexico_pos (dist_vect, level))
>>> + return false;
>>
>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
> Done.
>
>>
>>> +   /* If dependence is carried by outer loop of the two loops for
>>> +  interchange.  */
>>> +   if 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-11-16 Thread Bin.Cheng
On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz  wrote:
> Hello,
>
> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>
>> This is updated patch for loop interchange with review suggestions
>> resolved.  Changes are:
>>   1) It does more light weight checks like rectangle loop nest check
>> earlier than before.
>>   2) It checks profitability of interchange before data dependence 
>> computation.
>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>   4) Data dependence is open-computed so that we can skip instantly at
>> unknown dependence.
>>   5) It improves code generation in mapping induction variables for
>> loop nest, as well as
>>  adding a simple dead code elimination pass.
>>   6) It changes magic constants into parameters.
>
> So I have a couple comments/questions.  Something stylistic:
Hi Michael,
Thanks for reviewing.

>
>> +class loop_cand
>> +{
>> +public:
>> ...
>> +  friend class tree_loop_interchange;
>> +private:
>
> Just make this all public (and hence a struct, not class).
> No need for friends in file local classes.
Done.

>
>> +single_use_in_loop (tree var, struct loop *loop)
>> ...
>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>> +{
>> +  stmt = USE_STMT (use_p);
>> ...
>> +  basic_block bb = gimple_bb (stmt);
>> +  gcc_assert (bb != NULL);
>
> This pattern reoccurs often in your patch: you check for a bb associated
> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
> for checking.
Done.

>
> Then, something about your handling of simple reductions:
>
>> +void
>> +loop_cand::classify_simple_reduction (reduction_p re)
>> +{
>> ...
>> +  /* Require memory references in producer and consumer are the same so
>> + that we can undo reduction during interchange.  */
>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>> +return;
>
> Where is it checked that the undoing transformation is legal also
> from a data dep point of view?  Think code like this:
>
>sum = X[i];
>for (j ...)
>  sum += X[j];
>X[i] = sum;
>
> Moving the store into the inner loop isn't always correct and I don't seem
> to find where the above situation is rejected.
Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
in practice, it's hard to create an example.  The pass will give up
when computing
data dep between references in inner/outer loops.  In this updated
patch, it's fixed
by giving up if there is any dependence between references of inner/outer loops.

>
> Maybe I'm confused because I also don't see where you even can get into
> the above situation (though I do see testcases about this).  The thing is,
> for an 2d loop nest to contain something like the above reduction it can't
> be perfect:
>
>for (j) {
>  int sum = X[j];  // 1
>  for (i)
>sum += Y[j][i];
>  X[j] = sum;  // 2
>}
>
> But you do check for perfectness in proper_loop_form_for_interchange and
> prepare_perfect_loop_nest, so either you can't get into the situation or
> the checking can't be complete, or you define the above to be perfect
> nevertheless (probably because the load and store are in outer loop
> header/exit blocks?).  The latter would mean that you accept also other
> code in header/footer of loops from a pure CFG perspective, so where is it
> checked that that other code (which aren't simple reductions) isn't
> harmful to the transformation?
Yes, I used the name perfect loop nest, but the pass can handle special form
imperfect loop nest for the simple reduction.  I added comments describing
this before function prepare_perfect_loop_nest.

>
> Then, the data dependence part of the new pass:
>
>> +bool
>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned 
>> outer)
>> +{
>> +  struct data_dependence_relation *ddr;
>> +
>> +  for (unsigned i = 0; ddrs.iterate (i, ); ++i)
>> +{
>> +  /* Skip no-dependence case.  */
>> +  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> + continue;
>> +
>> +  for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>> + {
>> +   lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>> +   unsigned level = dependence_level (dist_vect, loop_nest.length ());
>> +
>> +   /* If there is no carried dependence.  */
>> +   if (level == 0)
>> + continue;
>> +
>> +   level --;
>> +   /* Skip case which has '>' as the leftmost direction.  */
>> +   if (!lambda_vector_lexico_pos (dist_vect, level))
>> + return false;
>
> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
Done.

>
>> +   /* If dependence is carried by outer loop of the two loops for
>> +  interchange.  */
>> +   if (level < outer)
>> + continue;
>> +
>> +   lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>> +   /* If directions at both inner/outer levels are the same.  */
>> +   if (dir_vect[inner] == dir_vect[outer])
>> +   

Re: [PATCH GCC]A simple implementation of loop interchange

2017-10-24 Thread Michael Matz
Hello,

On Fri, 22 Sep 2017, Bin.Cheng wrote:

> This is updated patch for loop interchange with review suggestions
> resolved.  Changes are:
>   1) It does more light weight checks like rectangle loop nest check
> earlier than before.
>   2) It checks profitability of interchange before data dependence 
> computation.
>   3) It calls find_data_references_in_loop only once for a loop nest now.
>   4) Data dependence is open-computed so that we can skip instantly at
> unknown dependence.
>   5) It improves code generation in mapping induction variables for
> loop nest, as well as
>  adding a simple dead code elimination pass.
>   6) It changes magic constants into parameters.

So I have a couple comments/questions.  Something stylistic:

> +class loop_cand
> +{
> +public:
> ...
> +  friend class tree_loop_interchange;
> +private:

Just make this all public (and hence a struct, not class).
No need for friends in file local classes.

> +single_use_in_loop (tree var, struct loop *loop)
> ...
> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> +{
> +  stmt = USE_STMT (use_p);
> ...
> +  basic_block bb = gimple_bb (stmt);
> +  gcc_assert (bb != NULL);

This pattern reoccurs often in your patch: you check for a bb associated 
for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need 
for checking.

Then, something about your handling of simple reductions:

> +void
> +loop_cand::classify_simple_reduction (reduction_p re)
> +{
> ...
> +  /* Require memory references in producer and consumer are the same so
> + that we can undo reduction during interchange.  */
> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
> +return;

Where is it checked that the undoing transformation is legal also
from a data dep point of view?  Think code like this:

   sum = X[i];
   for (j ...)
 sum += X[j];
   X[i] = sum;

Moving the store into the inner loop isn't always correct and I don't seem 
to find where the above situation is rejected.

Maybe I'm confused because I also don't see where you even can get into 
the above situation (though I do see testcases about this).  The thing is, 
for an 2d loop nest to contain something like the above reduction it can't 
be perfect:

   for (j) {
 int sum = X[j];  // 1
 for (i)
   sum += Y[j][i];
 X[j] = sum;  // 2
   }

But you do check for perfectness in proper_loop_form_for_interchange and 
prepare_perfect_loop_nest, so either you can't get into the situation or 
the checking can't be complete, or you define the above to be perfect 
nevertheless (probably because the load and store are in outer loop 
header/exit blocks?).  The latter would mean that you accept also other 
code in header/footer of loops from a pure CFG perspective, so where is it 
checked that that other code (which aren't simple reductions) isn't 
harmful to the transformation?

Then, the data dependence part of the new pass:

> +bool
> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned 
> outer)
> +{
> +  struct data_dependence_relation *ddr;
> +
> +  for (unsigned i = 0; ddrs.iterate (i, ); ++i)
> +{
> +  /* Skip no-dependence case.  */
> +  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> + continue;
> +
> +  for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
> + {
> +   lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
> +   unsigned level = dependence_level (dist_vect, loop_nest.length ());
> +
> +   /* If there is no carried dependence.  */
> +   if (level == 0)
> + continue;
> +
> +   level --;
> +   /* Skip case which has '>' as the leftmost direction.  */
> +   if (!lambda_vector_lexico_pos (dist_vect, level))
> + return false;

Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.

> +   /* If dependence is carried by outer loop of the two loops for
> +  interchange.  */
> +   if (level < outer)
> + continue;
> +
> +   lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
> +   /* If directions at both inner/outer levels are the same.  */
> +   if (dir_vect[inner] == dir_vect[outer])
> + continue;
> +
> +   /* Be conservative, skip case if either direction at inner/outer
> +  levels is not '=' or '<'.  */
> +   if (dir_vect[inner] != dir_equal
> +   && dir_vect[inner] != dir_positive
> +   && dir_vect[inner] != dir_independent
> +   && dir_vect[inner] != dir_positive_or_equal)
> + return false;
> +
> +   if (dir_vect[outer] != dir_equal
> +   && dir_vect[outer] != dir_positive
> +   && dir_vect[outer] != dir_independent
> +   && dir_vect[outer] != dir_positive_or_equal)
> + return false;

Checking dir vectors doesn't make much sense in GCC: the elements are only
ever set to dir_positive, dir_negative or dir_equal, exactly when distance 
is
 > 0, < 0 or == 0.  So checking dist vector is 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-09-22 Thread Bin.Cheng
On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
 wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng  wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>>  wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
 Hi,
 This patch implements a simple loop interchange pass in GCC, as described 
 by its comments:
 +/* This pass performs loop interchange: for example, the loop nest
 +
 +   for (int j = 0; j < N; j++)
 + for (int k = 0; k < N; k++)
 +   for (int i = 0; i < N; i++)
 +c[i][j] = c[i][j] + a[i][k]*b[k][j];
 +
 +   is transformed to
 +
 +   for (int i = 0; i < N; i++)
 + for (int j = 0; j < N; j++)
 +   for (int k = 0; k < N; k++)
 +c[i][j] = c[i][j] + a[i][k]*b[k][j];
 +
 +   This pass implements loop interchange in the following steps:
 +
 + 1) Find perfect loop nest for each innermost loop and compute data
 +   dependence relations for it.  For above example, loop nest is
 +   .
 + 2) From innermost to outermost loop, this pass tries to interchange
 +   each loop pair.  For above case, it firstly tries to interchange
 +    and loop nest becomes .
 +   Then it tries to interchange  and loop nest becomes
 +   .  The overall effect is to move innermost
 +   loop to the outermost position.  For loop pair 
 +   to be interchanged, we:
 + 3) Check if data dependence relations are valid for loop interchange.
 + 4) Check if both loops can be interchanged in terms of 
 transformation.
 + 5) Check if interchanging the two loops is profitable.
 + 6) Interchange the two loops by mapping induction variables.
 +
 +   This pass also handles reductions in loop nest.  So far we only support
 +   simple reduction of inner loop and double reduction of the loop nest.  
 */

 Actually, this pass only does loop shift which outermosting inner loop to 
 outer, rather
 than permutation.  Also as a traditional loop optimizer, it only works for 
 perfect loop
 nest.  I put it just after loop distribution thus ideally loop 
 split/distribution could
 create perfect nest for it.  Unfortunately, we don't get any perfect nest 
 from distribution
 for now because it only works for innermost loop.  For example, the 
 motivation case in
 spec2k6/bwaves is not handled on this pass alone.  I have a patch 
 extending distribution
 for (innermost) loop nest and with that patch bwaves case can be handled.
 Another point is I deliberately make both the cost model and code 
 transformation (very)
 conservative.  We can support more cases, or more transformations with 
 great care when
 it is for sure known beneficial.  IMHO, we already hit over-baked issues 
 quite often and
 don't want to introduce more.
 As for code generation, this patch has an issue that invariant code in 
 outer loop could
 be moved to inner loop.  For the moment, we rely on the last lim pass to 
 handle all INV
 generated during interchange.  In the future, we may need to avoid that in 
 interchange
 itself, or another lim pass just like the one after graphite optimizations.

 Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
 successfully.
 Note this pass is disabled in patch, while the code is exercised by 
 bootstrap/building
 programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loops.  */
>>> +#define OUTER_STRIDE_RATIO (2)
>>>
>>> please make all these knobs --params.
>>>
>>> +/* Enum type for loop reduction variable.  */
>>> +enum reduction_type
>>> +{
>>> +  UNKNOWN_RTYPE = 0,
>>> +  SIMPLE_RTYPE,
>>> +  DOUBLE_RTYPE
>>> +};
>>>
>>> seeing this we should have some generic data structure / analysis for
>>> reduction detection.  This adds a third user (autopar and vectorizer
>>> are the others).  Just an idea.
>>>
>>> +/* Return true if E is abnormal edge.  */
>>> +
>>> +static inline bool
>>> +abnormal_edge (edge e)
>>> +{
>>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>>> +}
>>>
>>> bad name/comment for what it does.
>>>
>>> ... jumping to end of file / start of pass
>>>
>>> +  /* Get the outer loop.  */
>>> +  loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>>
>>> loop_outer (loop)?
>>>
>>> +  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>>> +depends on outer loop's IV.  */
>>> +  if 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-09-04 Thread Bin.Cheng
On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
 wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng  wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>>  wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
 Hi,
 This patch implements a simple loop interchange pass in GCC, as described 
 by its comments:
 +/* This pass performs loop interchange: for example, the loop nest
 +
 +   for (int j = 0; j < N; j++)
 + for (int k = 0; k < N; k++)
 +   for (int i = 0; i < N; i++)
 +c[i][j] = c[i][j] + a[i][k]*b[k][j];
 +
 +   is transformed to
 +
 +   for (int i = 0; i < N; i++)
 + for (int j = 0; j < N; j++)
 +   for (int k = 0; k < N; k++)
 +c[i][j] = c[i][j] + a[i][k]*b[k][j];
 +
 +   This pass implements loop interchange in the following steps:
 +
 + 1) Find perfect loop nest for each innermost loop and compute data
 +   dependence relations for it.  For above example, loop nest is
 +   .
 + 2) From innermost to outermost loop, this pass tries to interchange
 +   each loop pair.  For above case, it firstly tries to interchange
 +    and loop nest becomes .
 +   Then it tries to interchange  and loop nest becomes
 +   .  The overall effect is to move innermost
 +   loop to the outermost position.  For loop pair 
 +   to be interchanged, we:
 + 3) Check if data dependence relations are valid for loop interchange.
 + 4) Check if both loops can be interchanged in terms of 
 transformation.
 + 5) Check if interchanging the two loops is profitable.
 + 6) Interchange the two loops by mapping induction variables.
 +
 +   This pass also handles reductions in loop nest.  So far we only support
 +   simple reduction of inner loop and double reduction of the loop nest.  
 */

 Actually, this pass only does loop shift which outermosting inner loop to 
 outer, rather
 than permutation.  Also as a traditional loop optimizer, it only works for 
 perfect loop
 nest.  I put it just after loop distribution thus ideally loop 
 split/distribution could
 create perfect nest for it.  Unfortunately, we don't get any perfect nest 
 from distribution
 for now because it only works for innermost loop.  For example, the 
 motivation case in
 spec2k6/bwaves is not handled on this pass alone.  I have a patch 
 extending distribution
 for (innermost) loop nest and with that patch bwaves case can be handled.
 Another point is I deliberately make both the cost model and code 
 transformation (very)
 conservative.  We can support more cases, or more transformations with 
 great care when
 it is for sure known beneficial.  IMHO, we already hit over-baked issues 
 quite often and
 don't want to introduce more.
 As for code generation, this patch has an issue that invariant code in 
 outer loop could
 be moved to inner loop.  For the moment, we rely on the last lim pass to 
 handle all INV
 generated during interchange.  In the future, we may need to avoid that in 
 interchange
 itself, or another lim pass just like the one after graphite optimizations.

 Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
 successfully.
 Note this pass is disabled in patch, while the code is exercised by 
 bootstrap/building
 programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loops.  */
>>> +#define OUTER_STRIDE_RATIO (2)
>>>
>>> please make all these knobs --params.
>>>
>>> +/* Enum type for loop reduction variable.  */
>>> +enum reduction_type
>>> +{
>>> +  UNKNOWN_RTYPE = 0,
>>> +  SIMPLE_RTYPE,
>>> +  DOUBLE_RTYPE
>>> +};
>>>
>>> seeing this we should have some generic data structure / analysis for
>>> reduction detection.  This adds a third user (autopar and vectorizer
>>> are the others).  Just an idea.
>>>
>>> +/* Return true if E is abnormal edge.  */
>>> +
>>> +static inline bool
>>> +abnormal_edge (edge e)
>>> +{
>>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>>> +}
>>>
>>> bad name/comment for what it does.
>>>
>>> ... jumping to end of file / start of pass
>>>
>>> +  /* Get the outer loop.  */
>>> +  loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>>
>>> loop_outer (loop)?
>>>
>>> +  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>>> +depends on outer loop's IV.  */
>>> +  if 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-09-04 Thread Richard Biener
On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng  wrote:
> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>  wrote:
>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
>>> Hi,
>>> This patch implements a simple loop interchange pass in GCC, as described 
>>> by its comments:
>>> +/* This pass performs loop interchange: for example, the loop nest
>>> +
>>> +   for (int j = 0; j < N; j++)
>>> + for (int k = 0; k < N; k++)
>>> +   for (int i = 0; i < N; i++)
>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   is transformed to
>>> +
>>> +   for (int i = 0; i < N; i++)
>>> + for (int j = 0; j < N; j++)
>>> +   for (int k = 0; k < N; k++)
>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   This pass implements loop interchange in the following steps:
>>> +
>>> + 1) Find perfect loop nest for each innermost loop and compute data
>>> +   dependence relations for it.  For above example, loop nest is
>>> +   .
>>> + 2) From innermost to outermost loop, this pass tries to interchange
>>> +   each loop pair.  For above case, it firstly tries to interchange
>>> +    and loop nest becomes .
>>> +   Then it tries to interchange  and loop nest becomes
>>> +   .  The overall effect is to move innermost
>>> +   loop to the outermost position.  For loop pair 
>>> +   to be interchanged, we:
>>> + 3) Check if data dependence relations are valid for loop interchange.
>>> + 4) Check if both loops can be interchanged in terms of transformation.
>>> + 5) Check if interchanging the two loops is profitable.
>>> + 6) Interchange the two loops by mapping induction variables.
>>> +
>>> +   This pass also handles reductions in loop nest.  So far we only support
>>> +   simple reduction of inner loop and double reduction of the loop nest.  
>>> */
>>>
>>> Actually, this pass only does loop shift which outermosting inner loop to 
>>> outer, rather
>>> than permutation.  Also as a traditional loop optimizer, it only works for 
>>> perfect loop
>>> nest.  I put it just after loop distribution thus ideally loop 
>>> split/distribution could
>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
>>> from distribution
>>> for now because it only works for innermost loop.  For example, the 
>>> motivation case in
>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
>>> distribution
>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>> Another point is I deliberately make both the cost model and code 
>>> transformation (very)
>>> conservative.  We can support more cases, or more transformations with 
>>> great care when
>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
>>> quite often and
>>> don't want to introduce more.
>>> As for code generation, this patch has an issue that invariant code in 
>>> outer loop could
>>> be moved to inner loop.  For the moment, we rely on the last lim pass to 
>>> handle all INV
>>> generated during interchange.  In the future, we may need to avoid that in 
>>> interchange
>>> itself, or another lim pass just like the one after graphite optimizations.
>>>
>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
>>> successfully.
>>> Note this pass is disabled in patch, while the code is exercised by 
>>> bootstrap/building
>>> programs with it enabled by default.  Any comments?
>>
> Thanks for quick review.
>> +/* The same as above, but this one is only used for interchanging not
>> +   innermost loops.  */
>> +#define OUTER_STRIDE_RATIO (2)
>>
>> please make all these knobs --params.
>>
>> +/* Enum type for loop reduction variable.  */
>> +enum reduction_type
>> +{
>> +  UNKNOWN_RTYPE = 0,
>> +  SIMPLE_RTYPE,
>> +  DOUBLE_RTYPE
>> +};
>>
>> seeing this we should have some generic data structure / analysis for
>> reduction detection.  This adds a third user (autopar and vectorizer
>> are the others).  Just an idea.
>>
>> +/* Return true if E is abnormal edge.  */
>> +
>> +static inline bool
>> +abnormal_edge (edge e)
>> +{
>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>> +}
>>
>> bad name/comment for what it does.
>>
>> ... jumping to end of file / start of pass
>>
>> +  /* Get the outer loop.  */
>> +  loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>
>> loop_outer (loop)?
>>
>> +  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>> +depends on outer loop's IV.  */
>> +  if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>> +   break;
>>
>> but you don't check for a three-nest whether niters depends on outer outer
>> loop's IV that way.  Either the check is superfluous here or incomplete.
> It is 

Re: [PATCH GCC]A simple implementation of loop interchange

2017-08-30 Thread Bin.Cheng
On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
 wrote:
> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
>> Hi,
>> This patch implements a simple loop interchange pass in GCC, as described by 
>> its comments:
>> +/* This pass performs loop interchange: for example, the loop nest
>> +
>> +   for (int j = 0; j < N; j++)
>> + for (int k = 0; k < N; k++)
>> +   for (int i = 0; i < N; i++)
>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   is transformed to
>> +
>> +   for (int i = 0; i < N; i++)
>> + for (int j = 0; j < N; j++)
>> +   for (int k = 0; k < N; k++)
>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   This pass implements loop interchange in the following steps:
>> +
>> + 1) Find perfect loop nest for each innermost loop and compute data
>> +   dependence relations for it.  For above example, loop nest is
>> +   .
>> + 2) From innermost to outermost loop, this pass tries to interchange
>> +   each loop pair.  For above case, it firstly tries to interchange
>> +    and loop nest becomes .
>> +   Then it tries to interchange  and loop nest becomes
>> +   .  The overall effect is to move innermost
>> +   loop to the outermost position.  For loop pair 
>> +   to be interchanged, we:
>> + 3) Check if data dependence relations are valid for loop interchange.
>> + 4) Check if both loops can be interchanged in terms of transformation.
>> + 5) Check if interchanging the two loops is profitable.
>> + 6) Interchange the two loops by mapping induction variables.
>> +
>> +   This pass also handles reductions in loop nest.  So far we only support
>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>
>> Actually, this pass only does loop shift which outermosting inner loop to 
>> outer, rather
>> than permutation.  Also as a traditional loop optimizer, it only works for 
>> perfect loop
>> nest.  I put it just after loop distribution thus ideally loop 
>> split/distribution could
>> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
>> from distribution
>> for now because it only works for innermost loop.  For example, the 
>> motivation case in
>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
>> distribution
>> for (innermost) loop nest and with that patch bwaves case can be handled.
>> Another point is I deliberately make both the cost model and code 
>> transformation (very)
>> conservative.  We can support more cases, or more transformations with great 
>> care when
>> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
>> quite often and
>> don't want to introduce more.
>> As for code generation, this patch has an issue that invariant code in outer 
>> loop could
>> be moved to inner loop.  For the moment, we rely on the last lim pass to 
>> handle all INV
>> generated during interchange.  In the future, we may need to avoid that in 
>> interchange
>> itself, or another lim pass just like the one after graphite optimizations.
>>
>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
>> successfully.
>> Note this pass is disabled in patch, while the code is exercised by 
>> bootstrap/building
>> programs with it enabled by default.  Any comments?
>
Thanks for quick review.
> +/* The same as above, but this one is only used for interchanging not
> +   innermost loops.  */
> +#define OUTER_STRIDE_RATIO (2)
>
> please make all these knobs --params.
>
> +/* Enum type for loop reduction variable.  */
> +enum reduction_type
> +{
> +  UNKNOWN_RTYPE = 0,
> +  SIMPLE_RTYPE,
> +  DOUBLE_RTYPE
> +};
>
> seeing this we should have some generic data structure / analysis for
> reduction detection.  This adds a third user (autopar and vectorizer
> are the others).  Just an idea.
>
> +/* Return true if E is abnormal edge.  */
> +
> +static inline bool
> +abnormal_edge (edge e)
> +{
> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
> +}
>
> bad name/comment for what it does.
>
> ... jumping to end of file / start of pass
>
> +  /* Get the outer loop.  */
> +  loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>
> loop_outer (loop)?
>
> +  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
> +depends on outer loop's IV.  */
> +  if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
> +   break;
>
> but you don't check for a three-nest whether niters depends on outer outer
> loop's IV that way.  Either the check is superfluous here or incomplete.
It is checked for multi-nest case in can_interchange_loops.  I will
move the check to this function so that we can save compilation time.
>
> +  /* Check if start_loop forms a perfect loop nest.  */
> +  

Re: [PATCH GCC]A simple implementation of loop interchange

2017-08-30 Thread Richard Biener
On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
> Hi,
> This patch implements a simple loop interchange pass in GCC, as described by 
> its comments:
> +/* This pass performs loop interchange: for example, the loop nest
> +
> +   for (int j = 0; j < N; j++)
> + for (int k = 0; k < N; k++)
> +   for (int i = 0; i < N; i++)
> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
> +
> +   is transformed to
> +
> +   for (int i = 0; i < N; i++)
> + for (int j = 0; j < N; j++)
> +   for (int k = 0; k < N; k++)
> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
> +
> +   This pass implements loop interchange in the following steps:
> +
> + 1) Find perfect loop nest for each innermost loop and compute data
> +   dependence relations for it.  For above example, loop nest is
> +   .
> + 2) From innermost to outermost loop, this pass tries to interchange
> +   each loop pair.  For above case, it firstly tries to interchange
> +    and loop nest becomes .
> +   Then it tries to interchange  and loop nest becomes
> +   .  The overall effect is to move innermost
> +   loop to the outermost position.  For loop pair 
> +   to be interchanged, we:
> + 3) Check if data dependence relations are valid for loop interchange.
> + 4) Check if both loops can be interchanged in terms of transformation.
> + 5) Check if interchanging the two loops is profitable.
> + 6) Interchange the two loops by mapping induction variables.
> +
> +   This pass also handles reductions in loop nest.  So far we only support
> +   simple reduction of inner loop and double reduction of the loop nest.  */
>
> Actually, this pass only does loop shift which outermosting inner loop to 
> outer, rather
> than permutation.  Also as a traditional loop optimizer, it only works for 
> perfect loop
> nest.  I put it just after loop distribution thus ideally loop 
> split/distribution could
> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
> from distribution
> for now because it only works for innermost loop.  For example, the 
> motivation case in
> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
> distribution
> for (innermost) loop nest and with that patch bwaves case can be handled.
> Another point is I deliberately make both the cost model and code 
> transformation (very)
> conservative.  We can support more cases, or more transformations with great 
> care when
> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
> quite often and
> don't want to introduce more.
> As for code generation, this patch has an issue that invariant code in outer 
> loop could
> be moved to inner loop.  For the moment, we rely on the last lim pass to 
> handle all INV
> generated during interchange.  In the future, we may need to avoid that in 
> interchange
> itself, or another lim pass just like the one after graphite optimizations.
>
> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
> successfully.
> Note this pass is disabled in patch, while the code is exercised by 
> bootstrap/building
> programs with it enabled by default.  Any comments?

+/* The same as above, but this one is only used for interchanging not
+   innermost loops.  */
+#define OUTER_STRIDE_RATIO (2)

please make all these knobs --params.

+/* Enum type for loop reduction variable.  */
+enum reduction_type
+{
+  UNKNOWN_RTYPE = 0,
+  SIMPLE_RTYPE,
+  DOUBLE_RTYPE
+};

seeing this we should have some generic data structure / analysis for
reduction detection.  This adds a third user (autopar and vectorizer
are the others).  Just an idea.

+/* Return true if E is abnormal edge.  */
+
+static inline bool
+abnormal_edge (edge e)
+{
+  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
+}

bad name/comment for what it does.

... jumping to end of file / start of pass

+  /* Get the outer loop.  */
+  loop = superloop_at_depth (loop, loop_depth (loop) - 1);

loop_outer (loop)?

+  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
+depends on outer loop's IV.  */
+  if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
+   break;

but you don't check for a three-nest whether niters depends on outer outer
loop's IV that way.  Either the check is superfluous here or incomplete.

+  /* Check if start_loop forms a perfect loop nest.  */
+  loop_nest->create (3);
+  do {
+datarefs->create (20);
+ddrs->create (20);
+loop_nest->truncate (0);
+if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
+  datarefs, ddrs))
+  {
+   unsigned i;
+   struct data_dependence_relation *ddr;
+
+   for (i = 0; ddrs->iterate (i, ); ++i)
+ {
+   if 

[PATCH GCC]A simple implementation of loop interchange

2017-08-30 Thread Bin Cheng
Hi,
This patch implements a simple loop interchange pass in GCC, as described by 
its comments:
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+ for (int k = 0; k < N; k++)
+   for (int i = 0; i < N; i++)
+c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+ for (int j = 0; j < N; j++)
+   for (int k = 0; k < N; k++)
+c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+ 1) Find perfect loop nest for each innermost loop and compute data
+   dependence relations for it.  For above example, loop nest is
+   .
+ 2) From innermost to outermost loop, this pass tries to interchange
+   each loop pair.  For above case, it firstly tries to interchange
+    and loop nest becomes .
+   Then it tries to interchange  and loop nest becomes
+   .  The overall effect is to move innermost
+   loop to the outermost position.  For loop pair 
+   to be interchanged, we:
+ 3) Check if data dependence relations are valid for loop interchange.
+ 4) Check if both loops can be interchanged in terms of transformation.
+ 5) Check if interchanging the two loops is profitable.
+ 6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */

Actually, this pass only does loop shift which outermosting inner loop to 
outer, rather
than permutation.  Also as a traditional loop optimizer, it only works for 
perfect loop
nest.  I put it just after loop distribution thus ideally loop 
split/distribution could
create perfect nest for it.  Unfortunately, we don't get any perfect nest from 
distribution
for now because it only works for innermost loop.  For example, the motivation 
case in
spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
distribution
for (innermost) loop nest and with that patch bwaves case can be handled.
Another point is I deliberately make both the cost model and code 
transformation (very)
conservative.  We can support more cases, or more transformations with great 
care when
it is for sure known beneficial.  IMHO, we already hit over-baked issues quite 
often and
don't want to introduce more.
As for code generation, this patch has an issue that invariant code in outer 
loop could
be moved to inner loop.  For the moment, we rely on the last lim pass to handle 
all INV
generated during interchange.  In the future, we may need to avoid that in 
interchange
itself, or another lim pass just like the one after graphite optimizations.

Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
successfully.
Note this pass is disabled in patch, while the code is exercised by 
bootstrap/building
programs with it enabled by default.  Any comments?

Thanks,
bin
2017-08-29  Bin Cheng  

* Makefile.in (tree-ssa-loop-interchange.o): New object file.
* common.opt (ftree-loop-interchange): New option.
* doc/invoke.texi (-ftree-loop-interchange): Document new option.
* passes.def (pass_linterchange): New pass.
* timevar.def (TV_LINTERCHANGE): New time var.
* tree-pass.h (make_pass_linterchange): New declaration.
* tree-ssa-loop-interchange.cc: New file.
* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-08-29  Bin Cheng  

* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
* gcc.dg/tree-ssa/loop-interchange-10.c: New test.diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 0bde7ac..5002598 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1522,6 +1522,7 @@ OBJS = \
tree-ssa-live.o \
tree-ssa-loop-ch.o \
tree-ssa-loop-im.o \
+   tree-ssa-loop-interchange.o \
tree-ssa-loop-ivcanon.o \
tree-ssa-loop-ivopts.o \
tree-ssa-loop-manip.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 1331008..e7efa09 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2524,6 +2524,10 @@ ftree-loop-distribute-patterns
 Common Report