Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On Sun, Oct 01, 2017 at 11:58:30AM +0200, Sven Verdoolaege wrote: > For the approach pluto is taking, you'll have to look at the source > code, see pluto_intra_tile_optimize_band. > For the other two approaches I mentioned above, reports will > be made available within the next couple of weeks. https://hal.inria.fr/hal-01628798 http://www.cs.kuleuven.be/publicaties/rapporten/cw/CW709.abs.html skimo
Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On Sat, Sep 30, 2017 at 07:47:43PM +0200, Richard Biener wrote: > On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop> wrote: > >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege > > wrote: > >> [Sorry for the resend; I used the wrong email address to CC Alex] > >> > >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > >>> Ah, so I now see why we do not perform interchange on trivial cases > >like > >>> > >>> double A[1024][1024], B[1024][1024]; > >>> > >>> void foo(void) > >>> { > >>> for (int i = 0; i < 1024; ++i) > >>> for (int j = 0; j < 1024; ++j) > >>> A[j][i] = B[j][i]; > >>> } > >> > >> I didn't see you mentioning _why_ you expect an interchange here. > >> Are you prehaps interested in spatial locality? > >> If so, then there are several approaches for taking > >> that into account. > >> - pluto performs an intra-tile loop interchange to > >> improve temporal and/or spatial locality. It shouldn't > >> be too hard to do something similar on an isl generated > >> schedule > >> - Alex (Oleksandr) has been working on an extension of > >> the isl scheduler that takes into account spatial locality. > >> I'm not sure if it's publicly available. > >> - I've been working on a special case of spatial locality > >> (consecutivity). The current version is available in > >> the consecutivity branch. Note that it may get rebased and > >> it may not necessarily get merged into master. > >> > >> There are also other approaches, but they may not be that > >> easy to combine with the isl scheduler. > > > >Would the following work? > >Add to the proximity relation the array accesses from two > >successive iterations of the innermost loop: > >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] > >With these two extra relations in the proximity map, > >isl should be able to interchange the above loop. > > Can anyone give a hint on how to do that in ISL terms? For the approach pluto is taking, you'll have to look at the source code, see pluto_intra_tile_optimize_band. For the other two approaches I mentioned above, reports will be made available within the next couple of weeks. For the last one, there is a sample implementation in the consecutivity branch of PPCG. skimo
Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Popwrote: >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege > wrote: >> [Sorry for the resend; I used the wrong email address to CC Alex] >> >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: >>> Ah, so I now see why we do not perform interchange on trivial cases >like >>> >>> double A[1024][1024], B[1024][1024]; >>> >>> void foo(void) >>> { >>> for (int i = 0; i < 1024; ++i) >>> for (int j = 0; j < 1024; ++j) >>> A[j][i] = B[j][i]; >>> } >> >> I didn't see you mentioning _why_ you expect an interchange here. >> Are you prehaps interested in spatial locality? >> If so, then there are several approaches for taking >> that into account. >> - pluto performs an intra-tile loop interchange to >> improve temporal and/or spatial locality. It shouldn't >> be too hard to do something similar on an isl generated >> schedule >> - Alex (Oleksandr) has been working on an extension of >> the isl scheduler that takes into account spatial locality. >> I'm not sure if it's publicly available. >> - I've been working on a special case of spatial locality >> (consecutivity). The current version is available in >> the consecutivity branch. Note that it may get rebased and >> it may not necessarily get merged into master. >> >> There are also other approaches, but they may not be that >> easy to combine with the isl scheduler. > >Would the following work? >Add to the proximity relation the array accesses from two >successive iterations of the innermost loop: >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] >With these two extra relations in the proximity map, >isl should be able to interchange the above loop. Can anyone give a hint on how to do that in ISL terms? Richard. >Sebastian
Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On 29/09/17 21:58, Sebastian Pop wrote: On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaegewrote: [Sorry for the resend; I used the wrong email address to CC Alex] On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: Ah, so I now see why we do not perform interchange on trivial cases like double A[1024][1024], B[1024][1024]; void foo(void) { for (int i = 0; i < 1024; ++i) for (int j = 0; j < 1024; ++j) A[j][i] = B[j][i]; } I didn't see you mentioning _why_ you expect an interchange here. Are you prehaps interested in spatial locality? If so, then there are several approaches for taking that into account. - pluto performs an intra-tile loop interchange to improve temporal and/or spatial locality. It shouldn't be too hard to do something similar on an isl generated schedule - Alex (Oleksandr) has been working on an extension of the isl scheduler that takes into account spatial locality. I'm not sure if it's publicly available. - I've been working on a special case of spatial locality (consecutivity). The current version is available in the consecutivity branch. Note that it may get rebased and it may not necessarily get merged into master. There are also other approaches, but they may not be that easy to combine with the isl scheduler. Would the following work? Add to the proximity relation the array accesses from two successive iterations of the innermost loop: A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] With these two extra relations in the proximity map, isl should be able to interchange the above loop. Sebastian Hi, this looks very close to what we do for spatial locality in the scheduler, except that we separate proximity and "spatial proximity" maps. There is a couple of caveats in just plugging those into proximity, in particular resolving conflicts between spatial and temporal locality and unnecessary skewing. Cheers, Alex -- Oleksandr Zinenko, Inria / École Normale Supérieure, cont...@ozinenko.com, oleksandr.zine...@inria.fr https://www.ozinenko.com
Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaegewrote: > [Sorry for the resend; I used the wrong email address to CC Alex] > > On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: >> Ah, so I now see why we do not perform interchange on trivial cases like >> >> double A[1024][1024], B[1024][1024]; >> >> void foo(void) >> { >> for (int i = 0; i < 1024; ++i) >> for (int j = 0; j < 1024; ++j) >> A[j][i] = B[j][i]; >> } > > I didn't see you mentioning _why_ you expect an interchange here. > Are you prehaps interested in spatial locality? > If so, then there are several approaches for taking > that into account. > - pluto performs an intra-tile loop interchange to > improve temporal and/or spatial locality. It shouldn't > be too hard to do something similar on an isl generated > schedule > - Alex (Oleksandr) has been working on an extension of > the isl scheduler that takes into account spatial locality. > I'm not sure if it's publicly available. > - I've been working on a special case of spatial locality > (consecutivity). The current version is available in > the consecutivity branch. Note that it may get rebased and > it may not necessarily get merged into master. > > There are also other approaches, but they may not be that > easy to combine with the isl scheduler. Would the following work? Add to the proximity relation the array accesses from two successive iterations of the innermost loop: A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] With these two extra relations in the proximity map, isl should be able to interchange the above loop. Sebastian
isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
[Sorry for the resend; I used the wrong email address to CC Alex] On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } I didn't see you mentioning _why_ you expect an interchange here. Are you prehaps interested in spatial locality? If so, then there are several approaches for taking that into account. - pluto performs an intra-tile loop interchange to improve temporal and/or spatial locality. It shouldn't be too hard to do something similar on an isl generated schedule - Alex (Oleksandr) has been working on an extension of the isl scheduler that takes into account spatial locality. I'm not sure if it's publicly available. - I've been working on a special case of spatial locality (consecutivity). The current version is available in the consecutivity branch. Note that it may get rebased and it may not necessarily get merged into master. There are also other approaches, but they may not be that easy to combine with the isl scheduler. skimo
isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)
On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } I didn't see you mentioning _why_ you expect an interchange here. Are you prehaps interested in spatial locality? If so, then there are several approaches for taking that into account. - pluto performs an intra-tile loop interchange to improve temporal and/or spatial locality. It shouldn't be too hard to do something similar on an isl generated schedule - Alex (Oleksandr) has been working on an extension of the isl scheduler that takes into account spatial locality. I'm not sure if it's publicly available. - I've been working on a special case of spatial locality (consecutivity). The current version is available in the consecutivity branch. Note that it may get rebased and it may not necessarily get merged into master. There are also other approaches, but they may not be that easy to combine with the isl scheduler. skimo
Re: [PATCH][GRAPHITE] More TLC
On Fri, Sep 29, 2017 at 6:17 AM, Richard Bienerwrote: > I fixed the "hack patch" somewhat but realized it's not really possible > ATM to get back at this form because the array descriptor contains only > information to generate the linearized form. So while I get now correct > values they end up with runtime divisions which aren't handled by > SCEV. You are right, SCEV has some limits on representing and folding those division expressions. There is a proposal in LLVM from Johannes Doerfert https://reviews.llvm.org/D38255 to use isl as a representation and expression folder instead of the chains of recurrences for the scalar evolution analysis. isl would be able to handle some of the semantics of the div_exprs, and the semantics of wrap-around variables, and of course it would have some other limits to represent multiplications (as we spoke about yesterday, i * N or M * N for example,) and thus that polyhedral analysis would need to rely on the delinearization of array indices.
Re: [PATCH][GRAPHITE] More TLC
On Thu, 28 Sep 2017, Sebastian Pop wrote: > On Wed, Sep 27, 2017 at 9:33 AM, Richard Bienerwrote: > > Looks like even when hacking the Fortran FE to produce nested > > ARRAY_REFs we run into the same issue for > > > > (gdb) p debug_data_reference (dr) > > #(Data Ref: > > # bb: 17 > > # stmt: > > VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: > > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > > sz: 8} = 0.0; > > # ref: > > VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: > > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > > sz: 8}; > > # base_object: > > VIEW_CONVERT_EXPR (*y_117(D)); > > # Access function 0: {1, +, 1}_4 > > # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, > > (unsigned long) stride.88_92}_3; > > # Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, > > (unsigned long) stride.90_96}_2; > > # Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, > > (unsigned long) stride.92_100}_1; > > > > so it looks like simple strided (where stride is a parameter) access > > is not handled either. > > Yes, this is the first option I was mentioning: it could work, > could you please make sure that you don't have a bug in the "hack patch" > where the outer dimension should not contain the parameter > (inner array dimension) times the access function. > > Example in C: > int A[100][N]; > A[i][j] is linearized as *(A + i * N * 4 + j * 4) > and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j] > Could you please check that it would delinearize back to A[i][j]? I fixed the "hack patch" somewhat but realized it's not really possible ATM to get back at this form because the array descriptor contains only information to generate the linearized form. So while I get now correct values they end up with runtime divisions which aren't handled by SCEV. I fear emitting delinearized code from the Fortran FE would be an ABI breakage as we'd have to change the array descriptor contents. > > > > GCCs dependence analysis can at least compute distances of two > > DRs when the difference of the access CHRECs is constant. Within > > the polyhedral model those cases cannot be handled? > > The difficulty for the polyhedral model is in the representation > of a multiplication of parameter times loop index variable. > The delinearization removes these difficulties by creating linear expressions. > > Think about multiplication as something introducing exponentiality > and you realize that any such expression would not fit in the > linear model of polyhedra. > A parameter is nothing else than an outer loop index to which we don't > have access to that loop level as it may be outside the current function > in which we get that parameter in. Yeah, I see that now. Richard.
Re: [PATCH][GRAPHITE] More TLC
On Wed, Sep 27, 2017 at 9:33 AM, Richard Bienerwrote: > Looks like even when hacking the Fortran FE to produce nested > ARRAY_REFs we run into the same issue for > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 17 > # stmt: > VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > sz: 8} = 0.0; > # ref: > VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > sz: 8}; > # base_object: > VIEW_CONVERT_EXPR (*y_117(D)); > # Access function 0: {1, +, 1}_4 > # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, > (unsigned long) stride.88_92}_3; > # Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, > (unsigned long) stride.90_96}_2; > # Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, > (unsigned long) stride.92_100}_1; > > so it looks like simple strided (where stride is a parameter) access > is not handled either. Yes, this is the first option I was mentioning: it could work, could you please make sure that you don't have a bug in the "hack patch" where the outer dimension should not contain the parameter (inner array dimension) times the access function. Example in C: int A[100][N]; A[i][j] is linearized as *(A + i * N * 4 + j * 4) and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j] Could you please check that it would delinearize back to A[i][j]? > > GCCs dependence analysis can at least compute distances of two > DRs when the difference of the access CHRECs is constant. Within > the polyhedral model those cases cannot be handled? The difficulty for the polyhedral model is in the representation of a multiplication of parameter times loop index variable. The delinearization removes these difficulties by creating linear expressions. Think about multiplication as something introducing exponentiality and you realize that any such expression would not fit in the linear model of polyhedra. A parameter is nothing else than an outer loop index to which we don't have access to that loop level as it may be outside the current function in which we get that parameter in. Sebastian
Re: [PATCH][GRAPHITE] More TLC
On Wed, Sep 27, 2017 at 8:04 AM, Richard Bienerwrote: > > Another thing I notice is that we don't handle the multi-dimensional > accesses the fortran frontend produces: > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 18 > # stmt: _43 = *a_141(D)[_42]; > # ref: *a_141(D)[_42]; > # base_object: *a_141(D); > # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, > stride.88_115}_5 > > ultimatively we fail here because we try to build a constraint for > > {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 > > which ends up computing isl_pw_aff_mul (A, stride.88_115) with > A being the non-constant constraint generated for > {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being > a parameter. ISL doesn't like that multiplication as the result > isn't affine (well - it is, we just have parameters in there). > > I suppose ISL doesn't handle this form of accesses given the > two "dimensions" in this scalarized form may overlap? So we'd > really need to turn those into references with different access > functions (even if that's not 100% a valid semantic transformation > as scalarization isn't reversible without extra information)? You are right. This multivariate memory access would be better handled in delinearized form: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66981 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 There are two ways to handle this issue: - fix the FORTRAN front-end to emit multi dimensions ARRAY_REFs, - implement an array delinearization pass, as I implemented in LLVM http://llvm.org/doxygen/Delinearization_8cpp_source.html "On Recovering Multi-Dimensional Arrays in Polly" http://impact.gforge.inria.fr/impact2015/papers/impact2015-grosser.pdf "Optimistic Delinearization of Parametrically Sized Arrays" https://dl.acm.org/citation.cfm?id=2751248 LLVM does not have an equivalent for multi-dim ARRAY_REF description it only reasons about linearized memory accesses like in GCC's RTL: gep = Get Element Pointer, so we had no other option than to delinearize. Sebastian
Re: [PATCH][GRAPHITE] More TLC
On Wed, Sep 27, 2017 at 7:18 AM, Richard Bienerwrote: > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener > wrote: > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener > > > wrote: > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I > made > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to > trunk. > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > 61 loop nests "optimized" > > > > > 45 loop nest transforms cancelled because of code generation > issues > > > > > 21 loop nest optimizations timed out the 35 ISL "operations" > we > > > allow > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > tiling > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > There's no way to automagically figure what kind of transform ISL > did > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > of the transform that isl did. > > > > > > > > The problem solved by isl is the minimization of strides > > > > in memory, and to do this, we need to tell the isl scheduler > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > (RAR + validity) maps. The proximity does include the > > > > read after read, as the isl scheduler needs to minimize > > > > strides between consecutive reads. > > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } > > which is probably because > > /* FIXME: proximity should not be validity. */ > isl_union_map *proximity = isl_union_map_copy (validity); > > falls apart when there is _no_ dependence? > You are right. The proximity needs to account for spatial locality as well if you want to interchange the loop. To describe the spatial locality, I would recommend adding to the proximity relation the array accesses from two successive iterations of the innermost loop: A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] With these two extra relations in the proximity map, isl should be able to interchange the above loop. > > I can trick GRAPHITE into performing the interchange for > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 1; i < 1023; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i-1] + A[j][i+1]; > } > > because now there is a dependence. Any idea on how to rewrite > scop_get_dependences to avoid "simplifying"? I suppose the > validity constraints _do_ also specify kind-of a proximity > Correct: the validity map specifies a subset (it is missing RAR dependences) of data reuse. > we just may not prune / optimize them in the same way as > dependences? > Validity constraints are there to "keep the wind blowing in the same direction" after the transform (otherwise the result of the transformed computation may be wrong.) The proximity map should contain a description of - reuse of memory (temporal locality) - how close together the access elements are (spatial locality.) isl will optimize for both if the proximity map has a description of both. For the moment the proximity map is initialized only with the current validity constraints, as you quoted the FIXME comment, which would only describe a subset of the temporal locality. Sebastian
Re: [PATCH][GRAPHITE] More TLC
Hi skimo, On Tue, Sep 26, 2017 at 10:15 AM, Sven Verdoolaege < sven.verdoola...@gmail.com> wrote: > On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote: > > Sven, is there already a function that computes the sum of all > > strides in a proximity map? Maybe you have code that does > > something similar in pet or ppcg? > > What exactly do you want to sum? If this involves any counting, then it cannot currently > I think that it does involve counting: we need to know the distance between all pairs of array accesses, that is the number of points in the dependence polyhedron. > be done in pet or ppcg since isl does not support counting yet > and the public version of barvinok is GPL licensed. > > Also, it's better to ask such questions on the isl mailing list > isl-developm...@googlegroups.com > > We are trying to find a metric that shows that isl's scheduler did a useful transform. Something like a diff tool that shows before and after scheduling the strides of array accesses. Could the isl scheduler output a description of what it did? We would like to use that output to build testcases that match the behavior of the compiler on different patterns. Thanks, Sebastian
Re: [PATCH][GRAPHITE] More TLC
On Wed, 27 Sep 2017, Richard Biener wrote: > On Wed, 27 Sep 2017, Richard Biener wrote: > > > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Bienerwrote: > > > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener > > > > wrote: > > > > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to > > > > > > trunk. > > > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > > > 61 loop nests "optimized" > > > > > > 45 loop nest transforms cancelled because of code generation issues > > > > > > 21 loop nest optimizations timed out the 35 ISL "operations" we > > > > allow > > > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > > tiling > > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > > There's no way to automagically figure what kind of transform ISL > > > > > > did > > > > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > > of the transform that isl did. > > > > > > > > > > The problem solved by isl is the minimization of strides > > > > > in memory, and to do this, we need to tell the isl scheduler > > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > > (RAR + validity) maps. The proximity does include the > > > > > read after read, as the isl scheduler needs to minimize > > > > > strides between consecutive reads. > > > > Ah, so I now see why we do not perform interchange on trivial cases like > > > > double A[1024][1024], B[1024][1024]; > > > > void foo(void) > > { > > for (int i = 0; i < 1024; ++i) > > for (int j = 0; j < 1024; ++j) > > A[j][i] = B[j][i]; > > } > > > > which is probably because > > > > /* FIXME: proximity should not be validity. */ > > isl_union_map *proximity = isl_union_map_copy (validity); > > > > falls apart when there is _no_ dependence? > > > > I can trick GRAPHITE into performing the interchange for > > > > double A[1024][1024], B[1024][1024]; > > > > void foo(void) > > { > > for (int i = 1; i < 1023; ++i) > > for (int j = 0; j < 1024; ++j) > > A[j][i] = B[j][i-1] + A[j][i+1]; > > } > > > > because now there is a dependence. Any idea on how to rewrite > > scop_get_dependences to avoid "simplifying"? I suppose the > > validity constraints _do_ also specify kind-of a proximity > > we just may not prune / optimize them in the same way as > > dependences? > > Another thing I notice is that we don't handle the multi-dimensional > accesses the fortran frontend produces: > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 18 > # stmt: _43 = *a_141(D)[_42]; > # ref: *a_141(D)[_42]; > # base_object: *a_141(D); > # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, > stride.88_115}_5 > > ultimatively we fail here because we try to build a constraint for > > {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 > > which ends up computing isl_pw_aff_mul (A, stride.88_115) with > A being the non-constant constraint generated for > {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being > a parameter. ISL doesn't like that multiplication as the result > isn't affine (well - it is, we just have parameters in there). > > I suppose ISL doesn't handle this form of accesses given the > two "dimensions" in this scalarized form may overlap? So we'd > really need to turn those into references with different access > functions (even if that's not 100% a valid semantic transformation > as scalarization isn't reversible without extra information)? Looks like even when hacking the Fortran FE to produce nested ARRAY_REFs we run into the same issue for (gdb) p debug_data_reference (dr) #(Data Ref: # bb: 17 # stmt: VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 sz: 8} = 0.0; # ref: VIEW_CONVERT_EXPR (*y_117(D))[_24]{lb: 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 sz: 8}; # base_object: VIEW_CONVERT_EXPR (*y_117(D)); # Access function 0: {1, +, 1}_4 # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, (unsigned long) stride.88_92}_3; # Access function 2: (integer(kind=8)) {(unsigned
Re: [PATCH][GRAPHITE] More TLC
On Wed, 27 Sep 2017, Richard Biener wrote: > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Bienerwrote: > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener > > > wrote: > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > 61 loop nests "optimized" > > > > > 45 loop nest transforms cancelled because of code generation issues > > > > > 21 loop nest optimizations timed out the 35 ISL "operations" we > > > allow > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > tiling > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > of the transform that isl did. > > > > > > > > The problem solved by isl is the minimization of strides > > > > in memory, and to do this, we need to tell the isl scheduler > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > (RAR + validity) maps. The proximity does include the > > > > read after read, as the isl scheduler needs to minimize > > > > strides between consecutive reads. > > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } > > which is probably because > > /* FIXME: proximity should not be validity. */ > isl_union_map *proximity = isl_union_map_copy (validity); > > falls apart when there is _no_ dependence? > > I can trick GRAPHITE into performing the interchange for > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 1; i < 1023; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i-1] + A[j][i+1]; > } > > because now there is a dependence. Any idea on how to rewrite > scop_get_dependences to avoid "simplifying"? I suppose the > validity constraints _do_ also specify kind-of a proximity > we just may not prune / optimize them in the same way as > dependences? Another thing I notice is that we don't handle the multi-dimensional accesses the fortran frontend produces: (gdb) p debug_data_reference (dr) #(Data Ref: # bb: 18 # stmt: _43 = *a_141(D)[_42]; # ref: *a_141(D)[_42]; # base_object: *a_141(D); # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 ultimatively we fail here because we try to build a constraint for {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 which ends up computing isl_pw_aff_mul (A, stride.88_115) with A being the non-constant constraint generated for {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being a parameter. ISL doesn't like that multiplication as the result isn't affine (well - it is, we just have parameters in there). I suppose ISL doesn't handle this form of accesses given the two "dimensions" in this scalarized form may overlap? So we'd really need to turn those into references with different access functions (even if that's not 100% a valid semantic transformation as scalarization isn't reversible without extra information)? Thanks, Richard.
Re: [PATCH][GRAPHITE] More TLC
On Tue, 26 Sep 2017, Sebastian Pop wrote: > On Mon, Sep 25, 2017 at 8:12 AM, Richard Bienerwrote: > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener > > wrote: > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > 61 loop nests "optimized" > > > > 45 loop nest transforms cancelled because of code generation issues > > > > 21 loop nest optimizations timed out the 35 ISL "operations" we > > allow > > > > > > > > I say "optimized" because the usual transform I've seen is static > > tiling > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > > > > Here is how to automate (without magic) the detection > > > of the transform that isl did. > > > > > > The problem solved by isl is the minimization of strides > > > in memory, and to do this, we need to tell the isl scheduler > > > the validity dependence graph, in graphite-optimize-isl.c > > > see the validity (RAW, WAR, WAW) and the proximity > > > (RAR + validity) maps. The proximity does include the > > > read after read, as the isl scheduler needs to minimize > > > strides between consecutive reads. Ah, so I now see why we do not perform interchange on trivial cases like double A[1024][1024], B[1024][1024]; void foo(void) { for (int i = 0; i < 1024; ++i) for (int j = 0; j < 1024; ++j) A[j][i] = B[j][i]; } which is probably because /* FIXME: proximity should not be validity. */ isl_union_map *proximity = isl_union_map_copy (validity); falls apart when there is _no_ dependence? I can trick GRAPHITE into performing the interchange for double A[1024][1024], B[1024][1024]; void foo(void) { for (int i = 1; i < 1023; ++i) for (int j = 0; j < 1024; ++j) A[j][i] = B[j][i-1] + A[j][i+1]; } because now there is a dependence. Any idea on how to rewrite scop_get_dependences to avoid "simplifying"? I suppose the validity constraints _do_ also specify kind-of a proximity we just may not prune / optimize them in the same way as dependences? Richard.
Re: [PATCH][GRAPHITE] More TLC
On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote: > Sven, is there already a function that computes the sum of all > strides in a proximity map? Maybe you have code that does > something similar in pet or ppcg? What exactly do you want to sum? If this involves any counting, then it cannot currently be done in pet or ppcg since isl does not support counting yet and the public version of barvinok is GPL licensed. Also, it's better to ask such questions on the isl mailing list isl-developm...@googlegroups.com skimo
Re: [PATCH][GRAPHITE] More TLC
On Mon, Sep 25, 2017 at 8:12 AM, Richard Bienerwrote: > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener > wrote: > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > when reworking canonicalize_loop_closed_ssa. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > 61 loop nests "optimized" > > > 45 loop nest transforms cancelled because of code generation issues > > > 21 loop nest optimizations timed out the 35 ISL "operations" we > allow > > > > > > I say "optimized" because the usual transform I've seen is static > tiling > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > Here is how to automate (without magic) the detection > > of the transform that isl did. > > > > The problem solved by isl is the minimization of strides > > in memory, and to do this, we need to tell the isl scheduler > > the validity dependence graph, in graphite-optimize-isl.c > > see the validity (RAW, WAR, WAW) and the proximity > > (RAR + validity) maps. The proximity does include the > > read after read, as the isl scheduler needs to minimize > > strides between consecutive reads. > > > > When you apply the schedule to the dependence graph, > > one can tell from the result the strides in memory, a good > > way to say whether a transform was beneficial is to sum up > > all memory strides, and make sure that the sum of all strides > > decreases after transform. We could add a printf with the > > sum of strides before and after transforms, and have the > > testcases check for that. > > Interesting. Can you perhaps show me in code how to do that? > > Sven, is there already a function that computes the sum of all strides in a proximity map? Maybe you have code that does something similar in pet or ppcg? Thanks, Sebastian
Re: [PATCH][GRAPHITE] More TLC
On Fri, 22 Sep 2017, Sebastian Pop wrote: > On Fri, Sep 22, 2017 at 8:03 AM, Richard Bienerwrote: > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > when reworking canonicalize_loop_closed_ssa. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > -Ofast -march=haswell -floop-nest-optimize are > > > > 61 loop nests "optimized" > > 45 loop nest transforms cancelled because of code generation issues > > 21 loop nest optimizations timed out the 35 ISL "operations" we allow > > > > I say "optimized" because the usual transform I've seen is static tiling > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > There's no way to automagically figure what kind of transform ISL did > > > > Here is how to automate (without magic) the detection > of the transform that isl did. > > The problem solved by isl is the minimization of strides > in memory, and to do this, we need to tell the isl scheduler > the validity dependence graph, in graphite-optimize-isl.c > see the validity (RAW, WAR, WAW) and the proximity > (RAR + validity) maps. The proximity does include the > read after read, as the isl scheduler needs to minimize > strides between consecutive reads. > > When you apply the schedule to the dependence graph, > one can tell from the result the strides in memory, a good > way to say whether a transform was beneficial is to sum up > all memory strides, and make sure that the sum of all strides > decreases after transform. We could add a printf with the > sum of strides before and after transforms, and have the > testcases check for that. Interesting. Can you perhaps show me in code how to do that? Thanks, Richard.
Re: [PATCH][GRAPHITE] More TLC
On Mon, 25 Sep 2017, Bin.Cheng wrote: > On Mon, Sep 25, 2017 at 1:46 PM, Richard Bienerwrote: > > On Mon, 25 Sep 2017, Richard Biener wrote: > > > >> On Fri, 22 Sep 2017, Richard Biener wrote: > >> > >> > > >> > This simplifies canonicalize_loop_closed_ssa and does other minimal > >> > TLC. It also adds a testcase I reduced from a stupid mistake I made > >> > when reworking canonicalize_loop_closed_ssa. > >> > > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > >> > > >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > >> > -Ofast -march=haswell -floop-nest-optimize are > >> > > >> > 61 loop nests "optimized" > >> > 45 loop nest transforms cancelled because of code generation issues > >> > 21 loop nest optimizations timed out the 35 ISL "operations" we > >> > allow > >> > >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize > >> and 709 sec. with (this was with release checking). > >> > >> A single-run has 416.gamess (580s -> 618s), > >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), > >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> > >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will > >> do a 3-run for those to confirm (it would be only a single regression > >> for 416.gamess). > > > > 416.gamess regression confirmed, 450.soplex improvement as well, > 436/437 improvements? 450.soplex (229s -> 226s) loops like noise. base is with -floop-nest-optimize, peak without. 416.gamess 19580619 31.7 S 19580576 34.0 * 416.gamess 19580614 31.9 S 19580577 33.9 S 416.gamess 19580618 31.7 * 19580576 34.0 S 436.cactusADM 11950194 61.5 S 11950204 58.5 S 436.cactusADM 11950184 65.0 S 11950187 63.8 * 436.cactusADM 11950186 64.1 * 11950186 64.1 S 437.leslie3d 9400219 43.0 S9400218 43.1 S 437.leslie3d 9400219 43.0 *9400223 42.1 S 437.leslie3d 9400218 43.0 S9400223 42.2 * 450.soplex 8340225 37.0 S8340231 36.1 S 450.soplex 8340226 36.9 *8340230 36.3 * 450.soplex 8340227 36.8 S8340229 36.4 S 465.tonto9840426 23.1 S9840427 23.0 * 465.tonto9840424 23.2 S9840430 22.9 S 465.tonto9840425 23.2 *9840425 23.2 S 401.bzip29650379 25.5 S9650378 25.5 S 401.bzip29650379 25.5 *9650380 25.4 * 401.bzip29650379 25.5 S9650380 25.4 S 462.libquantum 20720351 59.0 * 20720349 59.4 S 462.libquantum 20720351 59.0 S 20720345 60.1 * 462.libquantum 20720352 58.8 S 20720344 60.2 S > Thanks, > bin > > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose > > that's noise. > > > > Richard. > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: [PATCH][GRAPHITE] More TLC
On Mon, Sep 25, 2017 at 1:46 PM, Richard Bienerwrote: > On Mon, 25 Sep 2017, Richard Biener wrote: > >> On Fri, 22 Sep 2017, Richard Biener wrote: >> >> > >> > This simplifies canonicalize_loop_closed_ssa and does other minimal >> > TLC. It also adds a testcase I reduced from a stupid mistake I made >> > when reworking canonicalize_loop_closed_ssa. >> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. >> > >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with >> > -Ofast -march=haswell -floop-nest-optimize are >> > >> > 61 loop nests "optimized" >> > 45 loop nest transforms cancelled because of code generation issues >> > 21 loop nest optimizations timed out the 35 ISL "operations" we allow >> >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize >> and 709 sec. with (this was with release checking). >> >> A single-run has 416.gamess (580s -> 618s), >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will >> do a 3-run for those to confirm (it would be only a single regression >> for 416.gamess). > > 416.gamess regression confirmed, 450.soplex improvement as well, 436/437 improvements? 450.soplex (229s -> 226s) loops like noise. Thanks, bin > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose > that's noise. > > Richard.
Re: [PATCH][GRAPHITE] More TLC
On Mon, 25 Sep 2017, Richard Biener wrote: > On Fri, 22 Sep 2017, Richard Biener wrote: > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > when reworking canonicalize_loop_closed_ssa. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > -Ofast -march=haswell -floop-nest-optimize are > > > > 61 loop nests "optimized" > > 45 loop nest transforms cancelled because of code generation issues > > 21 loop nest optimizations timed out the 35 ISL "operations" we allow > > Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize > and 709 sec. with (this was with release checking). > > A single-run has 416.gamess (580s -> 618s), > 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), > 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> > 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will > do a 3-run for those to confirm (it would be only a single regression > for 416.gamess). 416.gamess regression confirmed, 450.soplex improvement as well, in the three-run 462.libquantum regresses (344s -> 351s) so I suppose that's noise. Richard.
Re: [PATCH][GRAPHITE] More TLC
On Fri, 22 Sep 2017, Richard Biener wrote: > > This simplifies canonicalize_loop_closed_ssa and does other minimal > TLC. It also adds a testcase I reduced from a stupid mistake I made > when reworking canonicalize_loop_closed_ssa. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > -Ofast -march=haswell -floop-nest-optimize are > > 61 loop nests "optimized" > 45 loop nest transforms cancelled because of code generation issues > 21 loop nest optimizations timed out the 35 ISL "operations" we allow Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize and 709 sec. with (this was with release checking). A single-run has 416.gamess (580s -> 618s), 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will do a 3-run for those to confirm (it would be only a single regression for 416.gamess). Sofar I'm positively surprised given the limitations (and inefficiencies) I know. I'll add some more opt-info stuff to assess the number of SCOPs we detect but discard during further analysis and the number of transforms we cancel because they turn out as a no-op. Richard. > { > - if (dump_file && dump_flags) > - fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n", > - max_operations); > + location_t loc = find_loop_location > + (scop->scop_info->region.entry->dest->loop_father); > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc, > +"loop nest not optimized, optimization timed out " > +"after %d operations [--param max-isl-operations]\n", > +max_operations); >return false; > } > > Index: gcc/graphite.c > === > --- gcc/graphite.c(revision 253091) > +++ gcc/graphite.c(working copy) > @@ -293,86 +293,6 @@ free_scops (vec scops) >scops.release (); > } > > -/* Returns true when P1 and P2 are close phis with the same > - argument. */ > - > -static inline bool > -same_close_phi_node (gphi *p1, gphi *p2) > -{ > - return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)), > - TREE_TYPE (gimple_phi_result (p2))) > - && operand_equal_p (gimple_phi_arg_def (p1, 0), > - gimple_phi_arg_def (p2, 0), 0)); > -} > - > -static void make_close_phi_nodes_unique (basic_block bb); > - > -/* Remove the close phi node at GSI and replace its rhs with the rhs > - of PHI. */ > - > -static void > -remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi) > -{ > - gimple *use_stmt; > - use_operand_p use_p; > - imm_use_iterator imm_iter; > - tree res = gimple_phi_result (phi); > - tree def = gimple_phi_result (gsi->phi ()); > - > - gcc_assert (same_close_phi_node (phi, gsi->phi ())); > - > - FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def) > -{ > - FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter) > - SET_USE (use_p, res); > - > - update_stmt (use_stmt); > - > - /* It is possible that we just created a duplicate close-phi > - for an already-processed containing loop. Check for this > - case and clean it up. */ > - if (gimple_code (use_stmt) == GIMPLE_PHI > - && gimple_phi_num_args (use_stmt) == 1) > - make_close_phi_nodes_unique (gimple_bb (use_stmt)); > -} > - > - remove_phi_node (gsi, true); > -} > - > -/* Removes all the close phi duplicates from BB. */ > - > -static void > -make_close_phi_nodes_unique (basic_block bb) > -{ > - gphi_iterator psi; > - > - for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next ()) > -{ > - gphi_iterator gsi = psi; > - gphi *phi = psi.phi (); > - > - /* At this point, PHI should be a close phi in normal form. */ > - gcc_assert (gimple_phi_num_args (phi) == 1); > - > - /* Iterate over the next phis and remove duplicates. */ > - gsi_next (); > - while (!gsi_end_p (gsi)) > - if (same_close_phi_node (phi, gsi.phi ())) > - remove_duplicate_close_phi (phi, ); > - else > - gsi_next (); > -} > -} > - > -/* Return true when NAME is defined in LOOP. */ > - > -static bool > -defined_in_loop_p (tree name, loop_p loop) > -{ > - gcc_assert (TREE_CODE (name) == SSA_NAME); > - return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name)); > -} > - > /* Transforms LOOP to the canonical loop closed SSA form. */ > > static void > @@ -380,20 +300,22 @@ canonicalize_loop_closed_ssa (loop_p loo > { >edge e = single_exit (loop); >basic_block bb; > + gphi_iterator psi; > >if (!e || (e->flags & EDGE_COMPLEX)) > return; > >bb = e->dest; > > + /* Make the loop-close PHI node BB contain only PHIs and have a > + single predecessor. */ >
Re: [PATCH][GRAPHITE] More TLC
On Fri, Sep 22, 2017 at 8:03 AM, Richard Bienerwrote: > > This simplifies canonicalize_loop_closed_ssa and does other minimal > TLC. It also adds a testcase I reduced from a stupid mistake I made > when reworking canonicalize_loop_closed_ssa. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > -Ofast -march=haswell -floop-nest-optimize are > > 61 loop nests "optimized" > 45 loop nest transforms cancelled because of code generation issues > 21 loop nest optimizations timed out the 35 ISL "operations" we allow > > I say "optimized" because the usual transform I've seen is static tiling > as enforced by GRAPHITE according to --param loop-block-tile-size. > There's no way to automagically figure what kind of transform ISL did > Here is how to automate (without magic) the detection of the transform that isl did. The problem solved by isl is the minimization of strides in memory, and to do this, we need to tell the isl scheduler the validity dependence graph, in graphite-optimize-isl.c see the validity (RAW, WAR, WAW) and the proximity (RAR + validity) maps. The proximity does include the read after read, as the isl scheduler needs to minimize strides between consecutive reads. When you apply the schedule to the dependence graph, one can tell from the result the strides in memory, a good way to say whether a transform was beneficial is to sum up all memory strides, and make sure that the sum of all strides decreases after transform. We could add a printf with the sum of strides before and after transforms, and have the testcases check for that. (usually none with the schedule identical check confused by FILTER > stuff positioning). This is also the issue with most GRAPHITE testcases. > We can't really verify easily whether we performed loop interchange > or not. We can probably tell whether we applied loop fusion or > splitting (by counting loops). > > I'm not aware of any remaining ICEs / wrong-code issues with GRAPHITE. > > I'm aware that the current "black-box" granularity hinders > scheduling freedom (each GIMPLE BB is mapped to a ISL stmt, this > is too coarse to schedule say two writes in a BB independently > from each other). Quick experiments could be done by simply > splitting gimple BBs at some points. > > I'm aware that the SCOP detection algorithm assumes that it can > walk loop->next and find loops "in order" -- but while that's > true for the initial flow_loops_find result (DFS walk) it isn't > true for any later created / discovered loops. Sorting of > loop siblings in DFS order should be easy (and a general cfgloopanal > helper). > > Richard. > > 2017-09-22 Richard Biener > > * graphite-isl-ast-to-gimple.c (graphite_verify): Inline into > single caller. > (graphite_regenerate_ast_isl): Do not reset SCEV. Move debug > print of no dependency loops ... > * graphite.c (graphite_transform_loops): ... here. > (canonicalize_loop_closed_ssa_form): Work from inner to outer > loops. > (same_close_phi_node, remove_duplicate_close_phi, > make_close_phi_nodes_unique, defined_in_loop_p): Fold into ... > (canonicalize_loop_closed_ssa): ... here and simplify. > * graphite-optimize-isl.c: Include tree-vectorizer.h. > (optimize_isl): Use dump_printf_loc to tell when we stopped > optimizing because of an ISL timeout. > > * gcc.dg/graphite/scop-24.c: New testcase. > > The change looks good to me. Thanks, Sebastian