Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
A small modification of the patch + extended comment for the mapping used for computing the separating class. The code didn't contained an adjustment of the stride with 1 (but the comment yes), that was temporary removed and forget to insert it back. The code generated now is: ISL AST generated by ISL: { for (int c0 = 0; c0 HEIGHT - 3; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = c0 + 3; c2 += 1) S_4(c2, c1); if ((HEIGHT - 1) % 4 = 2) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = -((HEIGHT - 1) % 4) + HEIGHT - 1; c2 HEIGHT; c2 += 1) S_4(c2, c1); } The previous code was correct too (after all the code set an option for AST but the transformation is preserved), but slightly less optimal in the case when N was a multiple of the stride. Mircea - Original Message - From: Sven Verdoolaege sk...@kotnet.org To: Mircea Namolaru mircea.namol...@inria.fr Cc: gcc-patches@gcc.gnu.org, Tobias Grosser tob...@grosser.es, Richard Biener richard.guent...@gmail.com, Albert Cohen albert.co...@inria.fr Sent: Wednesday, November 12, 2014 12:24:40 AM Subject: Re: [GRAPHITE, PATCH] Loop unroll and jam optimization On Tue, Nov 11, 2014 at 04:05:57PM +0100, Mircea Namolaru wrote: I'm not sure if Tobi or Albert have told you, but the separation_class option is going to be phased out since its design is fundamentally flawed. If you can't wait until isl-0.15, then I guess you have no choice but to use this option, but you should realize that it will remain frozen in its current broken state (until it is removed at some point). No, didn't know about the phase out of separation_class option. Anyway, for the time being is the best solution available. My understanding is that this option should always generate correct code, of course as long as the scheduling is correct, but think that had some cases when setting the separating_class leads to incorrect code. For isl_0.15, do you intend to provide some option with a similar functionality ? Yes. + /* Extract the original and auxiliar maps from pbb-transformed. + Set pbb-transformed to the original map. */ + psmap = smap; + psmap-n = 0; + res = isl_map_foreach_basic_map (pbb-transformed, separate_map, (void *)psmap); + gcc_assert (res == 0); + + isl_map_free(pbb-transformed); + pbb-transformed = isl_map_copy(psmap-map_arr[0]); + I have no idea what this pbb-transformed is supposed to represent, but you appear to be assuming that it has exactly two disjuncts and that they appear in some order. Now, perhaps you have explicitly checked that this map has two disjuncts, but then you should probably bring the check closer since any operation on sets that you perform could change the internal representation (even of other sets). However, in no way can you assume that isl_map_foreach_basic_map will iterate over these disjuncts in any specific order. At this point pbb-transformed has two basic maps, one is the mapping for unroll and jam, and one for the full tile for the striped dimension. Introduce a check that differentiate between them as the image of one maps should be included in the other. I didn't do a full review (and I won't have time for it either), but at a quick glance, you still seem to be assuming that if you take the union of two basic maps, that you can extract the original two basic maps from the union. In general, you can't. At least you shouldn't assume that you can. Besides, your updated code is also pretty convoluted, with very little documentation. skimo 2014-11-12 Mircea Namolaru mircea.namol...@inria.fr * common.opt (flag_loop_unroll_and_jam): New flag for unroll and jam. * params.def (PARAM_LOOP_UNROLL_JAM_SIZE) (PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag. * graphite-poly.h (struct poly_bb:map_sepclass): New field * graphite-poly.c (new_poly_bb): New field intialization. * graphite-isl-ast-to-gimple.c (generate_luj_sepclass_opt, generate_luj_sepclass,generate_luj_options): New functions to set AST options for unroll and jam. * graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast): Added support for unroll and jam options. * graphite-optimize-isl.c (getPrevectorMap_full): New function for sepating class map. * graphite-optimize-isl.c (optimize_isl,apply_schedule_map_to_scop, getScheduleMap,getScheduleForBand,getScheduleForBandList): Added support for the separating class map. * graphite.c: Support for unroll and jam flag. * graphite-poly.c: Likewise. * toplev.c: Likewise. Index: gcc/graphite-poly.h === --- gcc/graphite-poly.h
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
At this point pbb-transformed has two basic maps, one is the mapping for unroll and jam, and one for the full tile for the striped dimension. Introduce a check that differentiate between them as the image of one maps should be included in the other. I didn't do a full review (and I won't have time for it either), but at a quick glance, you still seem to be assuming that if you take the union of two basic maps, that you can extract the original two basic maps from the union. In general, you can't. At least you shouldn't assume that you can. Besides, your updated code is also pretty convoluted, with very little documentation. Many thanks. To considerably ease the computation of separation class, needed to transmit information available at scheduling computation time to AST computation time. I didn't want to introduce new fields in graphite global structures, and tried to trick pbb-transformed to contain the unroll and jam mapping as well as a mapping use to compute the separation class. It wasn't good idea (for the mappings used by unroll and jam my version of isl preserve the basic mappings afteer union), but of course my code should not be based on assumptions not guaranteed by isl semantics. So, in this updated patch I go the other way and keep the additional information required in the pbb structure. This has the advantages that removes the most convoluted part of my code dealing with the two basic maps kept in pbb-transformed. As before in graphite-optimize_isl,c, beside the map for unroll and jam, another map needed for the separation class defined by full tiles is computed. In graphite-isl-ast-to-gimple.c, the separation class option is set (using the auxiliary map computed previously) and is added to the other AST build options. Mircea 2014-11-12 Mircea Namolaru mircea.namol...@inria.fr * common.opt (flag_loop_unroll_and_jam): New flag for unroll and jam. * params.def (PARAM_LOOP_UNROLL_JAM_SIZE) (PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag. * graphite-poly.h (struct poly_bb:map_sepclass): New field * graphite-poly.c (new_poly_bb): New field intialization. * graphite-isl-ast-to-gimple.c (generate_luj_sepclass_opt, generate_luj_sepclass,generate_luj_options): New functions to set AST options for unroll and jam. * graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast): Added support for unroll and jam options. * graphite-optimize-isl.c (getPrevectorMap_full): New function for sepating class map. * graphite-optimize-isl.c (optimize_isl,apply_schedule_map_to_scop, getScheduleMap,getScheduleForBand,getScheduleForBandList): Added support for the separating class map. * graphite.c: Support for unroll and jam flag. * graphite-poly.c: Likewise. * toplev.c: Likewise. Index: gcc/toplev.c === --- gcc/toplev.c (revision 217013) +++ gcc/toplev.c (working copy) @@ -1302,11 +1302,12 @@ || flag_loop_block || flag_loop_interchange || flag_loop_strip_mine - || flag_loop_parallelize_all) + || flag_loop_parallelize_all + || flag_loop_unroll_jam) sorry (Graphite loop optimizations cannot be used (ISL is not available) (-fgraphite, -fgraphite-identity, -floop-block, -floop-interchange, -floop-strip-mine, -floop-parallelize-all, - and -ftree-loop-linear)); + -floop-unroll-and-jam, and -ftree-loop-linear)); #endif /* One region RA really helps to decrease the code size. */ Index: gcc/graphite-optimize-isl.c === --- gcc/graphite-optimize-isl.c (revision 217013) +++ gcc/graphite-optimize-isl.c (working copy) @@ -186,7 +186,7 @@ PartialSchedule = isl_band_get_partial_schedule (Band); *Dimensions = isl_band_n_member (Band); - if (DisableTiling) + if (DisableTiling || flag_loop_unroll_jam) return PartialSchedule; /* It does not make any sense to tile a band with just one dimension. */ @@ -241,7 +241,9 @@ constant number of iterations, if the number of loop iterations at DimToVectorize can be devided by VectorWidth. The default VectorWidth is currently constant and not yet target specific. This function does not reason - about parallelism. */ + about parallelism. + + */ static isl_map * getPrevectorMap (isl_ctx *ctx, int DimToVectorize, int ScheduleDimensions, @@ -305,8 +307,89 @@ isl_constraint_set_constant_si (c, VectorWidth - 1); TilingMap = isl_map_add_constraint (TilingMap, c); - isl_map_dump (TilingMap); + return TilingMap; +} +/* Compute an auxiliary map to getPrevectorMap, for computing the separating + class defined by full tiles. Used in graphite_isl_ast_to_gimple.c to set the + corresponding option for AST build. + */ +static isl_map *
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
Many thanks. Here is the new patch that fixes the main problem of the previous one (i.e separation of the loop after unroll and jam) as well as the problems raised by you (see comments below). Now the code with the separation class option looks: ISL AST generated by ISL: { for (int c0 = 0; c0 HEIGHT - 4; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = c0 + 3; c2 += 1) S_4(c2, c1); for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = -((HEIGHT - 1) % 4) + HEIGHT - 1; c2 HEIGHT; c2 += 1) S_4(c2, c1); } I tried the unroll option for AST, two loops are unrolled and the code looks like: ISL AST generated by ISL: { for (int c0 = 0; c0 HEIGHT - 4; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) { S_4(c0, c1); S_4(c0 + 1, c1); S_4(c0 + 2, c1); S_4(c0 + 3, c1); } for (int c1 = 0; c1 LENGTH - 3; c1 += 1) { S_4(-((HEIGHT - 1) % 4) + HEIGHT - 1, c1); if ((HEIGHT - 1) % 4 = 1) { S_4(-((HEIGHT - 1) % 4) + HEIGHT, c1); if ((HEIGHT - 1) % 4 = 2) { S_4(-((HEIGHT - 1) % 4) + HEIGHT + 1, c1); if (HEIGHT % 4 == 0) S_4(HEIGHT - 1, c1); } } } } As I don't quite like the unrolling of the second loop, and the GCC standard unrolling is able to unroll the first one, decided to not perfrom the unrolling within graphite. But if desirable, it could be done. The patch remains basically the same, two maps are build, one for the regular unroll and jam (i.e. stride mining) and the other for computing the separating class (i.e. its image is the image of the full tiles on strided dimension). In graphite-isl-ast-to-gimple.c, these two maps are used to build the separation class option and fix the scheduling. The main differences from the previous path is that the option separating class is set on a different dimension and a contraint was added to to the map used to build the separating_class. Now some comments to your message: I'm not sure if Tobi or Albert have told you, but the separation_class option is going to be phased out since its design is fundamentally flawed. If you can't wait until isl-0.15, then I guess you have no choice but to use this option, but you should realize that it will remain frozen in its current broken state (until it is removed at some point). No, didn't know about the phase out of separation_class option. Anyway, for the time being is the best solution available. My understanding is that this option should always generate correct code, of course as long as the scheduling is correct, but think that had some cases when setting the separating_class leads to incorrect code. For isl_0.15, do you intend to provide some option with a similar functionality ? + /* Extract the original and auxiliar maps from pbb-transformed. + Set pbb-transformed to the original map. */ + psmap = smap; + psmap-n = 0; + res = isl_map_foreach_basic_map (pbb-transformed, separate_map, (void *)psmap); + gcc_assert (res == 0); + + isl_map_free(pbb-transformed); + pbb-transformed = isl_map_copy(psmap-map_arr[0]); + I have no idea what this pbb-transformed is supposed to represent, but you appear to be assuming that it has exactly two disjuncts and that they appear in some order. Now, perhaps you have explicitly checked that this map has two disjuncts, but then you should probably bring the check closer since any operation on sets that you perform could change the internal representation (even of other sets). However, in no way can you assume that isl_map_foreach_basic_map will iterate over these disjuncts in any specific order. At this point pbb-transformed has two basic maps, one is the mapping for unroll and jam, and one for the full tile for the striped dimension. Introduce a check that differentiate between them as the image of one maps should be included in the other. In fact to prevent any isl side-effects, thought to introduce a new field pbb-transformed_full in the pbb structure to be on the safe side. Index: gcc/toplev.c === --- gcc/toplev.c (revision 217013) +++ gcc/toplev.c (working copy) @@ -1302,11 +1302,12 @@ || flag_loop_block || flag_loop_interchange || flag_loop_strip_mine - || flag_loop_parallelize_all) + || flag_loop_parallelize_all + || flag_loop_unroll_jam) sorry (Graphite loop optimizations cannot be used (ISL is not available) (-fgraphite, -fgraphite-identity, -floop-block, -floop-interchange, -floop-strip-mine, -floop-parallelize-all, - and -ftree-loop-linear)); + -floop-unroll-and-jam, and -ftree-loop-linear)); #endif /* One region RA really helps to decrease the code size. */ Index: gcc/graphite-optimize-isl.c === ---
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
Changed the option to -floop-unroll-and jam as you suggested. The patch takes advantage of the new isl based code generator introduced recently in GCC (in fact of the possible options for building the AST). The code generated for this optimization in the case of non-constant loop bounds initially looks as below. This is not very useful because the standard GCC unrolling don't succeed to unroll the most inner loop. ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1) Hmm, so this iterates at most 4 times, right? Eventually the body is considered too large by GCC or it fails to compute an upper bound for the number of iterations. Is that (an upper bound for the number of iterations) available readily from ISL at code-generation time? If so you can transfer this knowledge to the GCC loop information. The problem was not explained well. It is not only the unrolling, it is also the loop separation (which the latest version of the patch does). Even if the gcc unrolling succeeds to unroll the inner loop you will get a code similar with the one obtained by the previous version of this patch, which is not what is wanted. Last time when checked, GCC unrolling was not able to unroll the inner loop. In my opinion it is the min and max that prevent it (graphite for blocking, strip-mine, unroll and jam emits such code). The bounds of the iteration domain are expressed in min, max terms. I'm curious to see a testcase (and a way to generate the above form) to see what is actually the problem. Of course. Take the code from the unroll-and-jam patch and the attached test case (but as said other graphite options will generate similar code). But somehow it seems that the new isl based code generator could handle more easily such transformations. Mircea Thanks, Richard. S_4(c2, c1); Now, the separating class option (set for unroll and jam) produces this nice loop structure: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { for (int c2 = c0; c2 = c0 + 3; c2 += 1) S_4(c2, c1); } else for (int c2 = c0; c2 HEIGHT; c2 += 1) S_4(c2, c1); The unroll option (set for unroll and jam) produces: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { S_4(c0, c1); S_4(c0 + 1, c1); S_4(c0 + 2, c1); S_4(c0 + 3, c1); } else { S_4(c0, c1); if (HEIGHT = c0 + 2) { S_4(c0 + 1, c1); if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT c0 + 3 == HEIGHT) S_4(HEIGHT - 1, c1); } } The separate option (set by default for all dimensions for the new isl based code generator) don't succeed to remove the ifs from the loops and generate two loop structures (this would have been highly desirable). As the stage 1 is going to close soon, quick feedback to this patch is greatly appreciated. Many thanks, Mircea Namolaru int f1(int v[1024][1024], int HEIGHT, int LENGTH) { int i, j; for (i=0; iHEIGHT; i++) { for (j=3; j LENGTH; j++) { v[i][j] = v[i][j-3] + v[i][j-2] + v[i][j]; } } }
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
On Tue, Nov 11, 2014 at 04:05:57PM +0100, Mircea Namolaru wrote: I'm not sure if Tobi or Albert have told you, but the separation_class option is going to be phased out since its design is fundamentally flawed. If you can't wait until isl-0.15, then I guess you have no choice but to use this option, but you should realize that it will remain frozen in its current broken state (until it is removed at some point). No, didn't know about the phase out of separation_class option. Anyway, for the time being is the best solution available. My understanding is that this option should always generate correct code, of course as long as the scheduling is correct, but think that had some cases when setting the separating_class leads to incorrect code. For isl_0.15, do you intend to provide some option with a similar functionality ? Yes. + /* Extract the original and auxiliar maps from pbb-transformed. + Set pbb-transformed to the original map. */ + psmap = smap; + psmap-n = 0; + res = isl_map_foreach_basic_map (pbb-transformed, separate_map, (void *)psmap); + gcc_assert (res == 0); + + isl_map_free(pbb-transformed); + pbb-transformed = isl_map_copy(psmap-map_arr[0]); + I have no idea what this pbb-transformed is supposed to represent, but you appear to be assuming that it has exactly two disjuncts and that they appear in some order. Now, perhaps you have explicitly checked that this map has two disjuncts, but then you should probably bring the check closer since any operation on sets that you perform could change the internal representation (even of other sets). However, in no way can you assume that isl_map_foreach_basic_map will iterate over these disjuncts in any specific order. At this point pbb-transformed has two basic maps, one is the mapping for unroll and jam, and one for the full tile for the striped dimension. Introduce a check that differentiate between them as the image of one maps should be included in the other. I didn't do a full review (and I won't have time for it either), but at a quick glance, you still seem to be assuming that if you take the union of two basic maps, that you can extract the original two basic maps from the union. In general, you can't. At least you shouldn't assume that you can. Besides, your updated code is also pretty convoluted, with very little documentation. skimo
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
On Sat, Nov 08, 2014 at 12:32:05AM +0100, Mircea Namolaru wrote: Hello, This is the patch for unroll and jam optimizations. It is based on the code written by Tobias from graphite-optimize-isl.c (the code was unreachable till now) extended and enabled it via a new option -floop-unroll-jam. The patch takes advantage of the new isl based code generator introduced recently in GCC (in fact of the possible options for building the AST). The code generated for this optimization in the case of non-constant loop bounds initially looks as below. This is not very useful because the standard GCC unrolling don't succeed to unroll the most inner loop. ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1) S_4(c2, c1); Now, the separating class option (set for unroll and jam) produces this nice loop structure: I'm not sure if Tobi or Albert have told you, but the separation_class option is going to be phased out since its design is fundamentally flawed. If you can't wait until isl-0.15, then I guess you have no choice but to use this option, but you should realize that it will remain frozen in its current broken state (until it is removed at some point). + +/* Set the unroll AST built option. */ + +static __isl_give isl_union_map * +generate_isl_options_2 (scop_p scop, __isl_take isl_union_set *domain, + int dim, int cl) Not a very descriptive function name. +{ + isl_map *map; + isl_space *space, *space_sep; + isl_ctx *ctx; + isl_union_map *mapu; + int nsched = get_max_schedule_dimensions (scop); + + ctx = scop-ctx; + space_sep = isl_space_alloc(ctx, 0, 1, 1); + space_sep = isl_space_wrap(space_sep); + space_sep = isl_space_set_tuple_name(space_sep, isl_dim_set, +separation_class); + space = isl_set_get_space (scop-context); + space_sep = isl_space_align_params(space_sep, isl_space_copy(space)); + space = isl_space_map_from_domain_and_range(space, space_sep); + space = isl_space_add_dims (space,isl_dim_in, nsched); Inconsistent spacing, sometimes with a space before ( and sometimes without. I also noticed some tab/spacing inconsistencies later in the patch, but I already removed that part in my reply. + /* Extract the original and auxiliar maps from pbb-transformed. + Set pbb-transformed to the original map. */ + psmap = smap; + psmap-n = 0; + res = isl_map_foreach_basic_map (pbb-transformed, separate_map, (void *)psmap); + gcc_assert (res == 0); + + isl_map_free(pbb-transformed); + pbb-transformed = isl_map_copy(psmap-map_arr[0]); + I have no idea what this pbb-transformed is supposed to represent, but you appear to be assuming that it has exactly two disjuncts and that they appear in some order. Now, perhaps you have explicitly checked that this map has two disjuncts, but then you should probably bring the check closer since any operation on sets that you perform could change the internal representation (even of other sets). However, in no way can you assume that isl_map_foreach_basic_map will iterate over these disjuncts in any specific order. + bb_domain_s = isl_set_apply (isl_set_copy (bb_domain), +psmap-map_arr[1]); + bb_domain_r = isl_set_apply (bb_domain, psmap-map_arr[0]); + + bb_domain_s = isl_set_complement (bb_domain_s); + bb_domain_r = isl_set_subtract(bb_domain_r,bb_domain_s); Why are you subtracting the complement of a set instead of just intersecting with that set? + /* Create an identity map for everything except DimToVectorize and the + point loop. */ + for (i = 0; i ScheduleDimensions; i++) +{ + if (i == DimToVectorize) +continue; + + c = isl_equality_alloc (isl_local_space_copy (LocalSpace)); + + isl_constraint_set_coefficient_si (c, isl_dim_in, i, -1); + isl_constraint_set_coefficient_si (c, isl_dim_out, i, 1); + + TilingMap = isl_map_add_constraint (TilingMap, c); You can use isl_map_equate instead. @@ -350,17 +425,31 @@ SuffixSchedule); isl_band_list_free (Children); Gaack! gcc is using band forests too... I guess I'll have to keep them around for a while then skimo
Re: [GRAPHITE, PATCH] Loop unroll and jam optimization
On Sat, Nov 8, 2014 at 12:32 AM, Mircea Namolaru mircea.namol...@inria.fr wrote: Hello, This is the patch for unroll and jam optimizations. It is based on the code written by Tobias from graphite-optimize-isl.c (the code was unreachable till now) extended and enabled it via a new option -floop-unroll-jam. Why not -floop-unroll-and-jam? The patch takes advantage of the new isl based code generator introduced recently in GCC (in fact of the possible options for building the AST). The code generated for this optimization in the case of non-constant loop bounds initially looks as below. This is not very useful because the standard GCC unrolling don't succeed to unroll the most inner loop. ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1) Hmm, so this iterates at most 4 times, right? Eventually the body is considered too large by GCC or it fails to compute an upper bound for the number of iterations. Is that (an upper bound for the number of iterations) available readily from ISL at code-generation time? If so you can transfer this knowledge to the GCC loop information. I'm curious to see a testcase (and a way to generate the above form) to see what is actually the problem. Thanks, Richard. S_4(c2, c1); Now, the separating class option (set for unroll and jam) produces this nice loop structure: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { for (int c2 = c0; c2 = c0 + 3; c2 += 1) S_4(c2, c1); } else for (int c2 = c0; c2 HEIGHT; c2 += 1) S_4(c2, c1); The unroll option (set for unroll and jam) produces: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { S_4(c0, c1); S_4(c0 + 1, c1); S_4(c0 + 2, c1); S_4(c0 + 3, c1); } else { S_4(c0, c1); if (HEIGHT = c0 + 2) { S_4(c0 + 1, c1); if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT c0 + 3 == HEIGHT) S_4(HEIGHT - 1, c1); } } The separate option (set by default for all dimensions for the new isl based code generator) don't succeed to remove the ifs from the loops and generate two loop structures (this would have been highly desirable). As the stage 1 is going to close soon, quick feedback to this patch is greatly appreciated. Many thanks, Mircea Namolaru
[GRAPHITE, PATCH] Loop unroll and jam optimization
Hello, This is the patch for unroll and jam optimizations. It is based on the code written by Tobias from graphite-optimize-isl.c (the code was unreachable till now) extended and enabled it via a new option -floop-unroll-jam. The patch takes advantage of the new isl based code generator introduced recently in GCC (in fact of the possible options for building the AST). The code generated for this optimization in the case of non-constant loop bounds initially looks as below. This is not very useful because the standard GCC unrolling don't succeed to unroll the most inner loop. ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) for (int c2 = c0; c2 = min(HEIGHT - 1, c0 + 3); c2 += 1) S_4(c2, c1); Now, the separating class option (set for unroll and jam) produces this nice loop structure: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { for (int c2 = c0; c2 = c0 + 3; c2 += 1) S_4(c2, c1); } else for (int c2 = c0; c2 HEIGHT; c2 += 1) S_4(c2, c1); The unroll option (set for unroll and jam) produces: ISL AST generated by ISL: for (int c0 = 0; c0 HEIGHT; c0 += 4) for (int c1 = 0; c1 LENGTH - 3; c1 += 1) if (HEIGHT = c0 + 4) { S_4(c0, c1); S_4(c0 + 1, c1); S_4(c0 + 2, c1); S_4(c0 + 3, c1); } else { S_4(c0, c1); if (HEIGHT = c0 + 2) { S_4(c0 + 1, c1); if (4 * floord(HEIGHT - 3, 4) + 3 == HEIGHT c0 + 3 == HEIGHT) S_4(HEIGHT - 1, c1); } } The separate option (set by default for all dimensions for the new isl based code generator) don't succeed to remove the ifs from the loops and generate two loop structures (this would have been highly desirable). As the stage 1 is going to close soon, quick feedback to this patch is greatly appreciated. Many thanks, Mircea Namolaru 2014-11-7 Mircea Namolaru mircea.namol...@inria.fr * common.opt (flag_loop_unroll_jam): New flag for unroll and jam. * params.def (PARAM_LOOP_UNROLL_JAM_SIZE) (PARAM_LOOP_UNROLL_JAM_DEPTH): Parameters for unroll and jam flag. * graphite-isl-ast-to-gimple.c (struct_sep_map): New structure. * graphite-isl-ast-to-gimple.c (separate_map,generate_isl_options_2, generate_isl_options_1,generate_isl_domain_n,generate_isl_options): New functions to set the AST options for unroll and jam. * graphite-isl-ast-to-gimple.c (set_options,scop_to_isl_ast): Added support for unroll and jam options. * graphite-optimize-isl.c (getPrevectorMap_f): New function for unroll and jam. * graphite-optimize-isl.c (getScheduleForBand,getScheduleForBandList): Added support for unroll and jam. * graphite.c: Support for unroll and jam flag. * graphite-poly.c: Same. * toplev.c: Same. Index: gcc/graphite.c === --- gcc/graphite.c (revision 217013) +++ gcc/graphite.c (working copy) @@ -383,7 +383,8 @@ || flag_loop_strip_mine || flag_graphite_identity || flag_loop_parallelize_all - || flag_loop_optimize_isl) + || flag_loop_optimize_isl + || flag_loop_unroll_jam) flag_graphite = 1; return flag_graphite != 0; Index: gcc/common.opt === --- gcc/common.opt (revision 217013) +++ gcc/common.opt (working copy) @@ -1328,6 +1328,10 @@ Common Report Var(flag_loop_block) Optimization Enable Loop Blocking transformation +floop-unroll-jam +Common Report Var(flag_loop_unroll_jam) Optimization +Enable Loop Unroll Jam transformation + fgnu-tm Common Report Var(flag_tm) Enable support for GNU transactional memory Index: gcc/toplev.c === --- gcc/toplev.c (revision 217013) +++ gcc/toplev.c (working copy) @@ -1302,11 +1302,12 @@ || flag_loop_block || flag_loop_interchange || flag_loop_strip_mine - || flag_loop_parallelize_all) + || flag_loop_parallelize_all + || flag_loop_unroll_jam) sorry (Graphite loop optimizations cannot be used (ISL is not available) (-fgraphite, -fgraphite-identity, -floop-block, -floop-interchange, -floop-strip-mine, -floop-parallelize-all, - and -ftree-loop-linear)); + -floop-unroll-jam, and -ftree-loop-linear)); #endif /* One region RA really helps to decrease the code size. */ Index: gcc/params.def === --- gcc/params.def (revision 217013) +++ gcc/params.def (working copy) @@ -847,6 +847,21 @@ size of tiles for loop blocking, 51, 0, 0) +/* Size of unrolling factor for unroll-and-jam. */ + +DEFPARAM (PARAM_LOOP_UNROLL_JAM_SIZE, + loop-unroll-jam-size, + size of unrolling