Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-13 Thread Mircea Namolaru
A small modification of the patch + extended comment for the
mapping used for computing the separating class. The code didn't
contained an adjustment of the stride with 1 (but the comment yes),
that was temporary removed and forget to insert it back.

The code generated now is:

ISL AST generated by ISL: 
{
  for (int c0 = 0; c0  HEIGHT - 3; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  for (int c2 = c0; c2 = c0 + 3; c2 += 1)
S_4(c2, c1);
  if ((HEIGHT - 1) % 4 = 2)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  for (int c2 = -((HEIGHT - 1) % 4) + HEIGHT - 1; c2  HEIGHT; c2 += 1)
S_4(c2, c1);
}

The previous code was correct too (after all the code set an option for AST
but the transformation is preserved), but slightly less optimal in the case
when N was a multiple of the stride.

Mircea





- Original Message -
 From: Sven Verdoolaege sk...@kotnet.org
 To: Mircea Namolaru mircea.namol...@inria.fr
 Cc: gcc-patches@gcc.gnu.org, Tobias Grosser tob...@grosser.es, Richard 
 Biener richard.guent...@gmail.com,
 Albert Cohen albert.co...@inria.fr
 Sent: Wednesday, November 12, 2014 12:24:40 AM
 Subject: Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
 
 On Tue, Nov 11, 2014 at 04:05:57PM +0100, Mircea Namolaru wrote:
   I'm not sure if Tobi or Albert have told you, but the separation_class
   option
   is going to be phased out since its design is fundamentally flawed.
   If you can't wait until isl-0.15, then I guess you have no choice but
   to use this option, but you should realize that it will remain frozen
   in its current broken state (until it is removed at some point).
   
  
  No, didn't know about the phase out of separation_class option. Anyway, for
  the time
  being is the best solution available. My understanding is that this option
  should
  always generate correct code, of course as long as the scheduling is
  correct, but
  think that had some cases when setting the separating_class leads to
  incorrect code.
  
  For isl_0.15, do you intend to provide some option with a similar
  functionality ?
 
 Yes.
 
+  /* Extract the original and auxiliar maps from pbb-transformed.
+ Set pbb-transformed to the original map. */
+  psmap = smap;
+  psmap-n = 0;
+  res = isl_map_foreach_basic_map (pbb-transformed, separate_map,
(void *)psmap);
+  gcc_assert (res == 0);
+
+  isl_map_free(pbb-transformed);
+  pbb-transformed = isl_map_copy(psmap-map_arr[0]);
+
   
   I have no idea what this pbb-transformed is supposed to represent,
   but you appear to be assuming that it has exactly two disjuncts and that
   they appear in some order.  Now, perhaps you have explicitly
   checked that this map has two disjuncts, but then you should
   probably bring the check closer since any operation on sets that
   you perform could change the internal representation (even of
   other sets).  However, in no way can you assume that
   isl_map_foreach_basic_map will iterate over these disjuncts
   in any specific order.
  
  
  At this point pbb-transformed has two basic maps, one is the mapping for
  unroll and jam,
  and one for the full tile for the striped dimension. Introduce a check that
  differentiate
  between them as the image of one maps should be included in the other.
 
 I didn't do a full review (and I won't have time for it either),
 but at a quick glance,
 you still seem to be assuming that if you take the union of two
 basic maps, that you can extract the original two basic maps from the union.
 In general, you can't.  At least you shouldn't assume that you can.
 Besides, your updated code is also pretty convoluted,
 with very little documentation.
 
 skimo
 
2014-11-12  Mircea Namolaru  mircea.namol...@inria.fr

* common.opt (flag_loop_unroll_and_jam): New flag for unroll and jam.
* params.def (PARAM_LOOP_UNROLL_JAM_SIZE)
(PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag.
* graphite-poly.h (struct poly_bb:map_sepclass): New field
* graphite-poly.c (new_poly_bb): New field intialization.
* graphite-isl-ast-to-gimple.c (generate_luj_sepclass_opt,
generate_luj_sepclass,generate_luj_options):
New functions to set AST options for unroll and jam.
* graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast):
Added support for unroll and jam options.
* graphite-optimize-isl.c (getPrevectorMap_full): New function for
sepating class map.
* graphite-optimize-isl.c (optimize_isl,apply_schedule_map_to_scop,
getScheduleMap,getScheduleForBand,getScheduleForBandList):
Added support for the separating class map.
* graphite.c: Support for unroll and jam flag.
* graphite-poly.c: Likewise.
* toplev.c: Likewise.

Index: gcc/graphite-poly.h
===
--- gcc/graphite-poly.h

Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-12 Thread Mircea Namolaru

 
  At this point pbb-transformed has two basic maps, one is the mapping for
  unroll and jam,
  and one for the full tile for the striped dimension. Introduce a check that
  differentiate
  between them as the image of one maps should be included in the other.
 
 I didn't do a full review (and I won't have time for it either),
 but at a quick glance,
 you still seem to be assuming that if you take the union of two
 basic maps, that you can extract the original two basic maps from the union.
 In general, you can't.  At least you shouldn't assume that you can.
 Besides, your updated code is also pretty convoluted,
 with very little documentation.
 

Many thanks. To considerably ease the computation of separation class, needed 
to transmit information
available at scheduling computation time to AST computation time. I didn't want 
to
introduce new fields in graphite global structures, and tried to trick 
pbb-transformed
to contain the unroll and jam mapping as well as a mapping use to compute the 
separation
class. It wasn't good idea (for the mappings used by unroll and jam my version 
of isl preserve the basic mappings afteer union), but of course my code should 
not be based on
assumptions not guaranteed by isl semantics.

So, in this updated patch I go the other way and keep the additional 
information required in the
pbb structure. This has the advantages that removes the most convoluted part of 
my code dealing 
with the two basic maps kept in pbb-transformed. As before in 
graphite-optimize_isl,c, 
beside the map for unroll and jam, another map needed for the separation class 
defined by
full tiles is computed. In graphite-isl-ast-to-gimple.c, the separation class 
option is set (using 
the auxiliary map computed previously) and is added to the other AST build 
options. 

Mircea

2014-11-12  Mircea Namolaru  mircea.namol...@inria.fr

* common.opt (flag_loop_unroll_and_jam): New flag for unroll and jam.
* params.def (PARAM_LOOP_UNROLL_JAM_SIZE)
(PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag.
* graphite-poly.h (struct poly_bb:map_sepclass): New field
* graphite-poly.c (new_poly_bb): New field intialization.
* graphite-isl-ast-to-gimple.c (generate_luj_sepclass_opt,
generate_luj_sepclass,generate_luj_options):
New functions to set AST options for unroll and jam.
* graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast):
Added support for unroll and jam options.
* graphite-optimize-isl.c (getPrevectorMap_full): New function for
sepating class map.
* graphite-optimize-isl.c (optimize_isl,apply_schedule_map_to_scop,
getScheduleMap,getScheduleForBand,getScheduleForBandList):
Added support for the separating class map.
* graphite.c: Support for unroll and jam flag.
* graphite-poly.c: Likewise.
* toplev.c: Likewise.

Index: gcc/toplev.c
===
--- gcc/toplev.c	(revision 217013)
+++ gcc/toplev.c	(working copy)
@@ -1302,11 +1302,12 @@
   || flag_loop_block
   || flag_loop_interchange
   || flag_loop_strip_mine
-  || flag_loop_parallelize_all)
+  || flag_loop_parallelize_all
+  || flag_loop_unroll_jam)
 sorry (Graphite loop optimizations cannot be used (ISL is not available) 
 	   (-fgraphite, -fgraphite-identity, -floop-block, 
 	   -floop-interchange, -floop-strip-mine, -floop-parallelize-all, 
-	   and -ftree-loop-linear));
+	   -floop-unroll-and-jam, and -ftree-loop-linear));
 #endif
 
   /* One region RA really helps to decrease the code size.  */
Index: gcc/graphite-optimize-isl.c
===
--- gcc/graphite-optimize-isl.c	(revision 217013)
+++ gcc/graphite-optimize-isl.c	(working copy)
@@ -186,7 +186,7 @@
   PartialSchedule = isl_band_get_partial_schedule (Band);
   *Dimensions = isl_band_n_member (Band);
 
-  if (DisableTiling)
+  if (DisableTiling || flag_loop_unroll_jam)
 return PartialSchedule;
 
   /* It does not make any sense to tile a band with just one dimension.  */
@@ -241,7 +241,9 @@
constant number of iterations, if the number of loop iterations at
DimToVectorize can be devided by VectorWidth. The default VectorWidth is
currently constant and not yet target specific. This function does not reason
-   about parallelism.  */
+   about parallelism.
+
+  */
 static isl_map *
 getPrevectorMap (isl_ctx *ctx, int DimToVectorize,
 		 int ScheduleDimensions,
@@ -305,8 +307,89 @@
   isl_constraint_set_constant_si (c, VectorWidth - 1);
   TilingMap = isl_map_add_constraint (TilingMap, c);
 
-  isl_map_dump (TilingMap);
+  return TilingMap;
+}
 
+/* Compute an auxiliary map to getPrevectorMap, for computing the separating 
+   class defined by full tiles.  Used in graphite_isl_ast_to_gimple.c to set the 
+   corresponding option for AST build.
+ */ 
+static isl_map *

Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-11 Thread Mircea Namolaru

Many thanks. Here is the new patch that fixes the main problem of the previous 
one
(i.e separation of the loop after unroll and jam) as well as the problems raised
by you (see comments below).

Now the code with the separation class option looks: 

ISL AST generated by ISL: 
{
  for (int c0 = 0; c0  HEIGHT - 4; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  for (int c2 = c0; c2 = c0 + 3; c2 += 1)
S_4(c2, c1);
  for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
for (int c2 = -((HEIGHT - 1) % 4) + HEIGHT - 1; c2  HEIGHT; c2 += 1)
  S_4(c2, c1);
}
 
I tried the unroll option for AST, two loops are unrolled 
and the code looks like:

ISL AST generated by ISL: 
{
  for (int c0 = 0; c0  HEIGHT - 4; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1) {
  S_4(c0, c1);
  S_4(c0 + 1, c1);
  S_4(c0 + 2, c1);
  S_4(c0 + 3, c1);
}
  for (int c1 = 0; c1  LENGTH - 3; c1 += 1) {
S_4(-((HEIGHT - 1) % 4) + HEIGHT - 1, c1);
if ((HEIGHT - 1) % 4 = 1) {
  S_4(-((HEIGHT - 1) % 4) + HEIGHT, c1);
  if ((HEIGHT - 1) % 4 = 2) {
S_4(-((HEIGHT - 1) % 4) + HEIGHT + 1, c1);
if (HEIGHT % 4 == 0)
  S_4(HEIGHT - 1, c1);
  }
}
  }
}

As I don't quite like the unrolling of the second loop, and the GCC standard 
unrolling
is able to unroll the first one, decided to not perfrom the unrolling within 
graphite. 
But if desirable, it could be done.

The patch remains basically the same, two maps are build, one for the regular
unroll and jam (i.e. stride mining) and the other for computing the separating 
class 
(i.e. its image is  the image of the full tiles on strided dimension). In 
graphite-isl-ast-to-gimple.c, 
these two maps are used to build the separation class option and fix the 
scheduling. 

The main differences from the previous path is that the option separating class 
is set on a 
different dimension and a contraint was added to to the map used to build the 
separating_class. 

Now some comments to your message:
 
 
 I'm not sure if Tobi or Albert have told you, but the separation_class option
 is going to be phased out since its design is fundamentally flawed.
 If you can't wait until isl-0.15, then I guess you have no choice but
 to use this option, but you should realize that it will remain frozen
 in its current broken state (until it is removed at some point).
 

No, didn't know about the phase out of separation_class option. Anyway, for the 
time 
being is the best solution available. My understanding is that this option 
should
always generate correct code, of course as long as the scheduling is correct, 
but
think that had some cases when setting the separating_class leads to incorrect 
code. 

For isl_0.15, do you intend to provide some option with a similar functionality 
? 
 
  +  /* Extract the original and auxiliar maps from pbb-transformed.
  + Set pbb-transformed to the original map. */
  +  psmap = smap;
  +  psmap-n = 0;
  +  res = isl_map_foreach_basic_map (pbb-transformed, separate_map,
  (void *)psmap);
  +  gcc_assert (res == 0);
  +
  +  isl_map_free(pbb-transformed);
  +  pbb-transformed = isl_map_copy(psmap-map_arr[0]);
  +
 
 I have no idea what this pbb-transformed is supposed to represent,
 but you appear to be assuming that it has exactly two disjuncts and that
 they appear in some order.  Now, perhaps you have explicitly
 checked that this map has two disjuncts, but then you should
 probably bring the check closer since any operation on sets that
 you perform could change the internal representation (even of
 other sets).  However, in no way can you assume that
 isl_map_foreach_basic_map will iterate over these disjuncts
 in any specific order.


At this point pbb-transformed has two basic maps, one is the mapping for 
unroll and jam, 
and one for the full tile for the striped dimension. Introduce a check that 
differentiate 
between them as the image of one maps should be included in the other.

In fact to prevent any isl side-effects, thought to introduce a
new field pbb-transformed_full in the pbb structure to be on the safe side. 
 
 
Index: gcc/toplev.c
===
--- gcc/toplev.c	(revision 217013)
+++ gcc/toplev.c	(working copy)
@@ -1302,11 +1302,12 @@
   || flag_loop_block
   || flag_loop_interchange
   || flag_loop_strip_mine
-  || flag_loop_parallelize_all)
+  || flag_loop_parallelize_all
+  || flag_loop_unroll_jam)
 sorry (Graphite loop optimizations cannot be used (ISL is not available) 
 	   (-fgraphite, -fgraphite-identity, -floop-block, 
 	   -floop-interchange, -floop-strip-mine, -floop-parallelize-all, 
-	   and -ftree-loop-linear));
+	   -floop-unroll-and-jam, and -ftree-loop-linear));
 #endif
 
   /* One region RA really helps to decrease the code size.  */
Index: gcc/graphite-optimize-isl.c
===
--- 

Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-11 Thread Mircea Namolaru
Changed the option to -floop-unroll-and jam as you suggested.

  The patch takes advantage of the new isl based code generator introduced
  recently
  in GCC (in fact of the possible options for building the AST).
 
  The code generated for this optimization in the case of non-constant loop
  bounds
  initially looks as below. This is not very useful because the standard GCC
  unrolling don't succeed to unroll the most inner loop.
 
  ISL AST generated by ISL:
  for (int c0 = 0; c0  HEIGHT; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1)
 
 Hmm, so this iterates at most 4 times, right?  Eventually the body is
 considered
 too large by GCC or it fails to compute an upper bound for the number
 of iterations.
 Is that (an upper bound for the number of iterations) available readily from
 ISL
 at code-generation time?  If so you can transfer this knowledge to the GCC
 loop
 information.
 

The problem was not explained well. It is not only the unrolling, it is also
the loop separation (which the latest version of the patch does). Even if the 
gcc
unrolling succeeds to unroll the inner loop you will get a code similar with
the one obtained by the previous version of this patch, which is not what is 
wanted.

Last time when checked, GCC unrolling was not able to unroll the inner loop.
In my opinion it is the min and max that prevent it (graphite for blocking,
strip-mine, unroll and jam emits such code). The bounds of the iteration 
domain are expressed in min, max terms.
  
 I'm curious to see a testcase (and a way to generate the above form) to see
 what
 is actually the problem.
 

Of course. Take the code from the unroll-and-jam patch and the attached test
case (but as said other graphite options will generate similar code). But 
somehow
it seems that the new isl based code generator could handle more easily such
transformations.

Mircea



 Thanks,
 Richard.
 
S_4(c2, c1);
 
  Now, the separating class option (set for unroll and jam) produces this
  nice loop
  structure:
  ISL AST generated by ISL:
  for (int c0 = 0; c0  HEIGHT; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  if (HEIGHT = c0 + 4) {
for (int c2 = c0; c2 = c0 + 3; c2 += 1)
  S_4(c2, c1);
  } else
for (int c2 = c0; c2  HEIGHT; c2 += 1)
  S_4(c2, c1);
 
  The unroll option (set for unroll and jam) produces:
  ISL AST generated by ISL:
  for (int c0 = 0; c0  HEIGHT; c0 += 4)
for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
  if (HEIGHT = c0 + 4) {
S_4(c0, c1);
S_4(c0 + 1, c1);
S_4(c0 + 2, c1);
S_4(c0 + 3, c1);
  } else {
S_4(c0, c1);
if (HEIGHT = c0 + 2) {
  S_4(c0 + 1, c1);
  if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT  c0 + 3 == HEIGHT)
S_4(HEIGHT - 1, c1);
}
  }
 
  The separate option (set by default for all dimensions for the new isl
  based code generator)
  don't succeed to remove the ifs from the loops and generate two loop
  structures (this would
  have been highly desirable).
 
  As the stage 1 is going to close soon, quick feedback to this patch is
  greatly appreciated.
  Many thanks, Mircea Namolaru
 
int
f1(int v[1024][1024], int HEIGHT, int LENGTH)
{
  int i, j;

  for (i=0; iHEIGHT; i++) {
for (j=3; j LENGTH; j++) {
  v[i][j] = v[i][j-3] + v[i][j-2] + v[i][j];
}
  }

}


Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-11 Thread Sven Verdoolaege
On Tue, Nov 11, 2014 at 04:05:57PM +0100, Mircea Namolaru wrote:
  I'm not sure if Tobi or Albert have told you, but the separation_class 
  option
  is going to be phased out since its design is fundamentally flawed.
  If you can't wait until isl-0.15, then I guess you have no choice but
  to use this option, but you should realize that it will remain frozen
  in its current broken state (until it is removed at some point).
  
 
 No, didn't know about the phase out of separation_class option. Anyway, for 
 the time 
 being is the best solution available. My understanding is that this option 
 should
 always generate correct code, of course as long as the scheduling is correct, 
 but
 think that had some cases when setting the separating_class leads to 
 incorrect code. 
 
 For isl_0.15, do you intend to provide some option with a similar 
 functionality ? 

Yes.

   +  /* Extract the original and auxiliar maps from pbb-transformed.
   + Set pbb-transformed to the original map. */
   +  psmap = smap;
   +  psmap-n = 0;
   +  res = isl_map_foreach_basic_map (pbb-transformed, separate_map,
   (void *)psmap);
   +  gcc_assert (res == 0);
   +
   +  isl_map_free(pbb-transformed);
   +  pbb-transformed = isl_map_copy(psmap-map_arr[0]);
   +
  
  I have no idea what this pbb-transformed is supposed to represent,
  but you appear to be assuming that it has exactly two disjuncts and that
  they appear in some order.  Now, perhaps you have explicitly
  checked that this map has two disjuncts, but then you should
  probably bring the check closer since any operation on sets that
  you perform could change the internal representation (even of
  other sets).  However, in no way can you assume that
  isl_map_foreach_basic_map will iterate over these disjuncts
  in any specific order.
 
 
 At this point pbb-transformed has two basic maps, one is the mapping for 
 unroll and jam, 
 and one for the full tile for the striped dimension. Introduce a check that 
 differentiate 
 between them as the image of one maps should be included in the other.

I didn't do a full review (and I won't have time for it either),
but at a quick glance,
you still seem to be assuming that if you take the union of two
basic maps, that you can extract the original two basic maps from the union.
In general, you can't.  At least you shouldn't assume that you can.
Besides, your updated code is also pretty convoluted,
with very little documentation.

skimo


Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-08 Thread Sven Verdoolaege
On Sat, Nov 08, 2014 at 12:32:05AM +0100, Mircea Namolaru wrote:
 Hello, 
 
 This is the patch for unroll and jam optimizations. It is based on the
 code written by Tobias from graphite-optimize-isl.c (the code was
 unreachable till now) extended and enabled it via a new option 
 -floop-unroll-jam.
 The patch takes advantage of the new isl based code generator introduced 
 recently 
 in GCC (in fact of the possible options for building the AST).
 
 The code generated for this optimization in the case of non-constant loop 
 bounds
 initially looks as below. This is not very useful because the standard GCC 
 unrolling don't succeed to unroll the most inner loop.
 
 ISL AST generated by ISL: 
 for (int c0 = 0; c0  HEIGHT; c0 += 4)
   for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
 for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1)
   S_4(c2, c1);
 
 Now, the separating class option (set for unroll and jam) produces this 
 nice loop 
 structure: 

I'm not sure if Tobi or Albert have told you, but the separation_class option
is going to be phased out since its design is fundamentally flawed.
If you can't wait until isl-0.15, then I guess you have no choice but
to use this option, but you should realize that it will remain frozen
in its current broken state (until it is removed at some point).

 +
 +/* Set the unroll AST built option.  */
 +
 +static __isl_give isl_union_map *
 +generate_isl_options_2 (scop_p scop, __isl_take isl_union_set *domain, 
 + int dim, int cl)

Not a very descriptive function name.

 +{
 +  isl_map  *map;
 +  isl_space *space, *space_sep;
 +  isl_ctx *ctx;
 +  isl_union_map *mapu;
 +  int nsched = get_max_schedule_dimensions (scop);
 + 
 +  ctx = scop-ctx;
 +  space_sep = isl_space_alloc(ctx, 0, 1, 1);
 +  space_sep = isl_space_wrap(space_sep);
 +  space_sep = isl_space_set_tuple_name(space_sep, isl_dim_set,
 +separation_class);
 +  space = isl_set_get_space (scop-context);
 +  space_sep = isl_space_align_params(space_sep, isl_space_copy(space));
 +  space = isl_space_map_from_domain_and_range(space, space_sep);
 +  space = isl_space_add_dims (space,isl_dim_in, nsched);

Inconsistent spacing, sometimes with a space before ( and sometimes
without.
I also noticed some tab/spacing inconsistencies later in the patch,
but I already removed that part in my reply.

 +  /* Extract the original and auxiliar maps from pbb-transformed.
 + Set pbb-transformed to the original map. */
 +  psmap = smap;
 +  psmap-n = 0;
 +  res = isl_map_foreach_basic_map (pbb-transformed, separate_map, (void 
 *)psmap);
 +  gcc_assert (res == 0);
 +
 +  isl_map_free(pbb-transformed);
 +  pbb-transformed = isl_map_copy(psmap-map_arr[0]);
 +

I have no idea what this pbb-transformed is supposed to represent,
but you appear to be assuming that it has exactly two disjuncts and that
they appear in some order.  Now, perhaps you have explicitly
checked that this map has two disjuncts, but then you should
probably bring the check closer since any operation on sets that
you perform could change the internal representation (even of
other sets).  However, in no way can you assume that
isl_map_foreach_basic_map will iterate over these disjuncts
in any specific order.

 +  bb_domain_s = isl_set_apply (isl_set_copy (bb_domain),
 +psmap-map_arr[1]);
 +  bb_domain_r = isl_set_apply (bb_domain, psmap-map_arr[0]);  
 +
 +  bb_domain_s = isl_set_complement (bb_domain_s);
 +  bb_domain_r = isl_set_subtract(bb_domain_r,bb_domain_s);

Why are you subtracting the complement of a set instead of just
intersecting with that set?

 +  /* Create an identity map for everything except DimToVectorize and the 
 + point loop. */
 +  for (i = 0; i  ScheduleDimensions; i++)
 +{
 +  if (i == DimToVectorize)
 +continue;
 +
 +  c = isl_equality_alloc (isl_local_space_copy (LocalSpace));
 +
 +  isl_constraint_set_coefficient_si (c, isl_dim_in, i, -1);
 +  isl_constraint_set_coefficient_si (c, isl_dim_out, i, 1);
 +
 +  TilingMap = isl_map_add_constraint (TilingMap, c);

You can use isl_map_equate instead.

 @@ -350,17 +425,31 @@
 SuffixSchedule);
 isl_band_list_free (Children);

Gaack!  gcc is using band forests too...
I guess I'll have to keep them around for a while then

skimo


Re: [GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-08 Thread Richard Biener
On Sat, Nov 8, 2014 at 12:32 AM, Mircea Namolaru
mircea.namol...@inria.fr wrote:
 Hello,

 This is the patch for unroll and jam optimizations. It is based on the
 code written by Tobias from graphite-optimize-isl.c (the code was
 unreachable till now) extended and enabled it via a new option 
 -floop-unroll-jam.

Why not -floop-unroll-and-jam?

 The patch takes advantage of the new isl based code generator introduced 
 recently
 in GCC (in fact of the possible options for building the AST).

 The code generated for this optimization in the case of non-constant loop 
 bounds
 initially looks as below. This is not very useful because the standard GCC
 unrolling don't succeed to unroll the most inner loop.

 ISL AST generated by ISL:
 for (int c0 = 0; c0  HEIGHT; c0 += 4)
   for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
 for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1)

Hmm, so this iterates at most 4 times, right?  Eventually the body is considered
too large by GCC or it fails to compute an upper bound for the number
of iterations.
Is that (an upper bound for the number of iterations) available readily from ISL
at code-generation time?  If so you can transfer this knowledge to the GCC loop
information.

I'm curious to see a testcase (and a way to generate the above form) to see what
is actually the problem.

Thanks,
Richard.

   S_4(c2, c1);

 Now, the separating class option (set for unroll and jam) produces this 
 nice loop
 structure:
 ISL AST generated by ISL:
 for (int c0 = 0; c0  HEIGHT; c0 += 4)
   for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
 if (HEIGHT = c0 + 4) {
   for (int c2 = c0; c2 = c0 + 3; c2 += 1)
 S_4(c2, c1);
 } else
   for (int c2 = c0; c2  HEIGHT; c2 += 1)
 S_4(c2, c1);

 The unroll option (set for unroll and jam) produces:
 ISL AST generated by ISL:
 for (int c0 = 0; c0  HEIGHT; c0 += 4)
   for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
 if (HEIGHT = c0 + 4) {
   S_4(c0, c1);
   S_4(c0 + 1, c1);
   S_4(c0 + 2, c1);
   S_4(c0 + 3, c1);
 } else {
   S_4(c0, c1);
   if (HEIGHT = c0 + 2) {
 S_4(c0 + 1, c1);
 if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT  c0 + 3 == HEIGHT)
   S_4(HEIGHT - 1, c1);
   }
 }

 The separate option (set by default for all dimensions for the new isl 
 based code generator)
 don't succeed to remove the ifs from the loops and generate two loop 
 structures (this would
 have been highly desirable).

 As the stage 1 is going to close soon, quick feedback to this patch is 
 greatly appreciated.
 Many thanks, Mircea Namolaru


[GRAPHITE, PATCH] Loop unroll and jam optimization

2014-11-07 Thread Mircea Namolaru
Hello, 

This is the patch for unroll and jam optimizations. It is based on the
code written by Tobias from graphite-optimize-isl.c (the code was
unreachable till now) extended and enabled it via a new option 
-floop-unroll-jam.
The patch takes advantage of the new isl based code generator introduced 
recently 
in GCC (in fact of the possible options for building the AST).

The code generated for this optimization in the case of non-constant loop bounds
initially looks as below. This is not very useful because the standard GCC 
unrolling don't succeed to unroll the most inner loop.

ISL AST generated by ISL: 
for (int c0 = 0; c0  HEIGHT; c0 += 4)
  for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1)
  S_4(c2, c1);

Now, the separating class option (set for unroll and jam) produces this nice 
loop 
structure: 
ISL AST generated by ISL: 
for (int c0 = 0; c0  HEIGHT; c0 += 4)
  for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
if (HEIGHT = c0 + 4) {
  for (int c2 = c0; c2 = c0 + 3; c2 += 1)
S_4(c2, c1);
} else
  for (int c2 = c0; c2  HEIGHT; c2 += 1)
S_4(c2, c1);

The unroll option (set for unroll and jam) produces: 
ISL AST generated by ISL: 
for (int c0 = 0; c0  HEIGHT; c0 += 4)
  for (int c1 = 0; c1  LENGTH - 3; c1 += 1)
if (HEIGHT = c0 + 4) {
  S_4(c0, c1);
  S_4(c0 + 1, c1);
  S_4(c0 + 2, c1);
  S_4(c0 + 3, c1);
} else {
  S_4(c0, c1);
  if (HEIGHT = c0 + 2) {
S_4(c0 + 1, c1);
if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT  c0 + 3 == HEIGHT)
  S_4(HEIGHT - 1, c1);
  }
}

The separate option (set by default for all dimensions for the new isl based 
code generator) 
don't succeed to remove the ifs from the loops and generate two loop structures 
(this would
have been highly desirable).   

As the stage 1 is going to close soon, quick feedback to this patch is greatly 
appreciated.
Many thanks, Mircea Namolaru
2014-11-7  Mircea Namolaru  mircea.namol...@inria.fr

* common.opt (flag_loop_unroll_jam): New flag for unroll and jam.
* params.def (PARAM_LOOP_UNROLL_JAM_SIZE)
(PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag.
* graphite-isl-ast-to-gimple.c (struct_sep_map): New structure.
* graphite-isl-ast-to-gimple.c (separate_map,generate_isl_options_2,
generate_isl_options_1,generate_isl_domain_n,generate_isl_options):
New functions to set the AST options for unroll and jam.
* graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast):
Added support for unroll and jam options.
* graphite-optimize-isl.c (getPrevectorMap_f): New function for
unroll and jam.
* graphite-optimize-isl.c (getScheduleForBand,getScheduleForBandList):
Added support for unroll and jam.
* graphite.c: Support for unroll and jam flag.
* graphite-poly.c: Same.
* toplev.c: Same.

Index: gcc/graphite.c
===
--- gcc/graphite.c	(revision 217013)
+++ gcc/graphite.c	(working copy)
@@ -383,7 +383,8 @@
   || flag_loop_strip_mine
   || flag_graphite_identity
   || flag_loop_parallelize_all
-  || flag_loop_optimize_isl)
+  || flag_loop_optimize_isl
+  || flag_loop_unroll_jam)
 flag_graphite = 1;
 
   return flag_graphite != 0;
Index: gcc/common.opt
===
--- gcc/common.opt	(revision 217013)
+++ gcc/common.opt	(working copy)
@@ -1328,6 +1328,10 @@
 Common Report Var(flag_loop_block) Optimization
 Enable Loop Blocking transformation
 
+floop-unroll-jam
+Common Report Var(flag_loop_unroll_jam) Optimization
+Enable Loop Unroll Jam transformation
+ 
 fgnu-tm
 Common Report Var(flag_tm)
 Enable support for GNU transactional memory
Index: gcc/toplev.c
===
--- gcc/toplev.c	(revision 217013)
+++ gcc/toplev.c	(working copy)
@@ -1302,11 +1302,12 @@
   || flag_loop_block
   || flag_loop_interchange
   || flag_loop_strip_mine
-  || flag_loop_parallelize_all)
+  || flag_loop_parallelize_all
+  || flag_loop_unroll_jam)
 sorry (Graphite loop optimizations cannot be used (ISL is not available) 
 	   (-fgraphite, -fgraphite-identity, -floop-block, 
 	   -floop-interchange, -floop-strip-mine, -floop-parallelize-all, 
-	   and -ftree-loop-linear));
+	   -floop-unroll-jam, and -ftree-loop-linear));
 #endif
 
   /* One region RA really helps to decrease the code size.  */
Index: gcc/params.def
===
--- gcc/params.def	(revision 217013)
+++ gcc/params.def	(working copy)
@@ -847,6 +847,21 @@
 	  size of tiles for loop blocking,
 	  51, 0, 0)
 
+/* Size of unrolling factor for unroll-and-jam.  */
+ 
+DEFPARAM (PARAM_LOOP_UNROLL_JAM_SIZE,
+ 	  loop-unroll-jam-size,
+ 	  size of unrolling