Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-11-11 Thread Sven Verdoolaege
On Sun, Oct 01, 2017 at 11:58:30AM +0200, Sven Verdoolaege wrote:
> For the approach pluto is taking, you'll have to look at the source
> code, see pluto_intra_tile_optimize_band.
> For the other two approaches I mentioned above, reports will
> be made available within the next couple of weeks.

https://hal.inria.fr/hal-01628798
http://www.cs.kuleuven.be/publicaties/rapporten/cw/CW709.abs.html

skimo


Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-10-01 Thread Sven Verdoolaege
On Sat, Sep 30, 2017 at 07:47:43PM +0200, Richard Biener wrote:
> On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop  
> wrote:
> >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
> > wrote:
> >> [Sorry for the resend; I used the wrong email address to CC Alex]
> >>
> >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> >>> Ah, so I now see why we do not perform interchange on trivial cases
> >like
> >>>
> >>> double A[1024][1024], B[1024][1024];
> >>>
> >>> void foo(void)
> >>> {
> >>>   for (int i = 0; i < 1024; ++i)
> >>> for (int j = 0; j < 1024; ++j)
> >>>   A[j][i] = B[j][i];
> >>> }
> >>
> >> I didn't see you mentioning _why_ you expect an interchange here.
> >> Are you prehaps interested in spatial locality?
> >> If so, then there are several approaches for taking
> >> that into account.
> >> - pluto performs an intra-tile loop interchange to
> >>   improve temporal and/or spatial locality.  It shouldn't
> >>   be too hard to do something similar on an isl generated
> >>   schedule
> >> - Alex (Oleksandr) has been working on an extension of
> >>   the isl scheduler that takes into account spatial locality.
> >>   I'm not sure if it's publicly available.
> >> - I've been working on a special case of spatial locality
> >>   (consecutivity).  The current version is available in
> >>   the consecutivity branch.  Note that it may get rebased and
> >>   it may not necessarily get merged into master.
> >>
> >> There are also other approaches, but they may not be that
> >> easy to combine with the isl scheduler.
> >
> >Would the following work?
> >Add to the proximity relation the array accesses from two
> >successive iterations of the innermost loop:
> >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
> >With these two extra relations in the proximity map,
> >isl should be able to interchange the above loop.
> 
> Can anyone give a hint on how to do that in ISL terms? 

For the approach pluto is taking, you'll have to look at the source
code, see pluto_intra_tile_optimize_band.
For the other two approaches I mentioned above, reports will
be made available within the next couple of weeks.
For the last one, there is a sample implementation in the
consecutivity branch of PPCG.

skimo


Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-09-30 Thread Richard Biener
On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop  
wrote:
>On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
> wrote:
>> [Sorry for the resend; I used the wrong email address to CC Alex]
>>
>> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
>>> Ah, so I now see why we do not perform interchange on trivial cases
>like
>>>
>>> double A[1024][1024], B[1024][1024];
>>>
>>> void foo(void)
>>> {
>>>   for (int i = 0; i < 1024; ++i)
>>> for (int j = 0; j < 1024; ++j)
>>>   A[j][i] = B[j][i];
>>> }
>>
>> I didn't see you mentioning _why_ you expect an interchange here.
>> Are you prehaps interested in spatial locality?
>> If so, then there are several approaches for taking
>> that into account.
>> - pluto performs an intra-tile loop interchange to
>>   improve temporal and/or spatial locality.  It shouldn't
>>   be too hard to do something similar on an isl generated
>>   schedule
>> - Alex (Oleksandr) has been working on an extension of
>>   the isl scheduler that takes into account spatial locality.
>>   I'm not sure if it's publicly available.
>> - I've been working on a special case of spatial locality
>>   (consecutivity).  The current version is available in
>>   the consecutivity branch.  Note that it may get rebased and
>>   it may not necessarily get merged into master.
>>
>> There are also other approaches, but they may not be that
>> easy to combine with the isl scheduler.
>
>Would the following work?
>Add to the proximity relation the array accesses from two
>successive iterations of the innermost loop:
>A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
>With these two extra relations in the proximity map,
>isl should be able to interchange the above loop.

Can anyone give a hint on how to do that in ISL terms? 

Richard. 

>Sebastian



Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-09-29 Thread Oleksandr Zinenko



On 29/09/17 21:58, Sebastian Pop wrote:

On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
 wrote:

[Sorry for the resend; I used the wrong email address to CC Alex]

On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:

Ah, so I now see why we do not perform interchange on trivial cases like

double A[1024][1024], B[1024][1024];

void foo(void)
{
   for (int i = 0; i < 1024; ++i)
 for (int j = 0; j < 1024; ++j)
   A[j][i] = B[j][i];
}

I didn't see you mentioning _why_ you expect an interchange here.
Are you prehaps interested in spatial locality?
If so, then there are several approaches for taking
that into account.
- pluto performs an intra-tile loop interchange to
   improve temporal and/or spatial locality.  It shouldn't
   be too hard to do something similar on an isl generated
   schedule
- Alex (Oleksandr) has been working on an extension of
   the isl scheduler that takes into account spatial locality.
   I'm not sure if it's publicly available.
- I've been working on a special case of spatial locality
   (consecutivity).  The current version is available in
   the consecutivity branch.  Note that it may get rebased and
   it may not necessarily get merged into master.

There are also other approaches, but they may not be that
easy to combine with the isl scheduler.

Would the following work?
Add to the proximity relation the array accesses from two
successive iterations of the innermost loop:
A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
With these two extra relations in the proximity map,
isl should be able to interchange the above loop.

Sebastian

Hi,

this looks very close to what we do for spatial locality in the 
scheduler, except that we separate proximity and "spatial proximity" 
maps.  There is a couple of caveats in just plugging those into 
proximity, in particular resolving conflicts between spatial and 
temporal locality and unnecessary skewing.


Cheers,
Alex

--
Oleksandr Zinenko,
Inria / École Normale Supérieure,
cont...@ozinenko.com, oleksandr.zine...@inria.fr
https://www.ozinenko.com



Re: isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-09-29 Thread Sebastian Pop
On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
 wrote:
> [Sorry for the resend; I used the wrong email address to CC Alex]
>
> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
>> Ah, so I now see why we do not perform interchange on trivial cases like
>>
>> double A[1024][1024], B[1024][1024];
>>
>> void foo(void)
>> {
>>   for (int i = 0; i < 1024; ++i)
>> for (int j = 0; j < 1024; ++j)
>>   A[j][i] = B[j][i];
>> }
>
> I didn't see you mentioning _why_ you expect an interchange here.
> Are you prehaps interested in spatial locality?
> If so, then there are several approaches for taking
> that into account.
> - pluto performs an intra-tile loop interchange to
>   improve temporal and/or spatial locality.  It shouldn't
>   be too hard to do something similar on an isl generated
>   schedule
> - Alex (Oleksandr) has been working on an extension of
>   the isl scheduler that takes into account spatial locality.
>   I'm not sure if it's publicly available.
> - I've been working on a special case of spatial locality
>   (consecutivity).  The current version is available in
>   the consecutivity branch.  Note that it may get rebased and
>   it may not necessarily get merged into master.
>
> There are also other approaches, but they may not be that
> easy to combine with the isl scheduler.

Would the following work?
Add to the proximity relation the array accesses from two
successive iterations of the innermost loop:
A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
With these two extra relations in the proximity map,
isl should be able to interchange the above loop.

Sebastian


isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-09-29 Thread Sven Verdoolaege
[Sorry for the resend; I used the wrong email address to CC Alex]

On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i];
> }

I didn't see you mentioning _why_ you expect an interchange here.
Are you prehaps interested in spatial locality?
If so, then there are several approaches for taking
that into account.
- pluto performs an intra-tile loop interchange to
  improve temporal and/or spatial locality.  It shouldn't
  be too hard to do something similar on an isl generated
  schedule
- Alex (Oleksandr) has been working on an extension of
  the isl scheduler that takes into account spatial locality.
  I'm not sure if it's publicly available.
- I've been working on a special case of spatial locality
  (consecutivity).  The current version is available in
  the consecutivity branch.  Note that it may get rebased and
  it may not necessarily get merged into master.

There are also other approaches, but they may not be that
easy to combine with the isl scheduler.

skimo


isl scheduler and spatial locality (Re: [PATCH][GRAPHITE] More TLC)

2017-09-29 Thread Sven Verdoolaege
On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i];
> }

I didn't see you mentioning _why_ you expect an interchange here.
Are you prehaps interested in spatial locality?
If so, then there are several approaches for taking
that into account.
- pluto performs an intra-tile loop interchange to
  improve temporal and/or spatial locality.  It shouldn't
  be too hard to do something similar on an isl generated
  schedule
- Alex (Oleksandr) has been working on an extension of
  the isl scheduler that takes into account spatial locality.
  I'm not sure if it's publicly available.
- I've been working on a special case of spatial locality
  (consecutivity).  The current version is available in
  the consecutivity branch.  Note that it may get rebased and
  it may not necessarily get merged into master.

There are also other approaches, but they may not be that
easy to combine with the isl scheduler.

skimo


Re: [PATCH][GRAPHITE] More TLC

2017-09-29 Thread Sebastian Pop
On Fri, Sep 29, 2017 at 6:17 AM, Richard Biener  wrote:
> I fixed the "hack patch" somewhat but realized it's not really possible
> ATM to get back at this form because the array descriptor contains only
> information to generate the linearized form.  So while I get now correct
> values they end up with runtime divisions which aren't handled by
> SCEV.

You are right, SCEV has some limits on representing and folding
those division expressions.

There is a proposal in LLVM from Johannes Doerfert
https://reviews.llvm.org/D38255
to use isl as a representation and expression folder instead of the
chains of recurrences for the scalar evolution analysis.  isl would
be able to handle some of the semantics of the div_exprs, and the
semantics of wrap-around variables, and of course it would have
some other limits to represent multiplications (as we spoke about
yesterday, i * N or M * N for example,) and thus that polyhedral
analysis would need to rely on the delinearization of array indices.


Re: [PATCH][GRAPHITE] More TLC

2017-09-29 Thread Richard Biener
On Thu, 28 Sep 2017, Sebastian Pop wrote:

> On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener  wrote:
> > Looks like even when hacking the Fortran FE to produce nested
> > ARRAY_REFs we run into the same issue for
> >
> > (gdb) p debug_data_reference (dr)
> > #(Data Ref:
> > #  bb: 17
> > #  stmt:
> > VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
> > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> > sz: 8} = 0.0;
> > #  ref:
> > VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
> > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> > sz: 8};
> > #  base_object:
> > VIEW_CONVERT_EXPR(*y_117(D));
> > #  Access function 0: {1, +, 1}_4
> > #  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +,
> > (unsigned long) stride.88_92}_3;
> > #  Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +,
> > (unsigned long) stride.90_96}_2;
> > #  Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +,
> > (unsigned long) stride.92_100}_1;
> >
> > so it looks like simple strided (where stride is a parameter) access
> > is not handled either.
> 
> Yes, this is the first option I was mentioning: it could work,
> could you please make sure that you don't have a bug in the "hack patch"
> where the outer dimension should not contain the parameter
> (inner array dimension) times the access function.
> 
> Example in C:
> int A[100][N];
> A[i][j] is linearized as *(A + i * N * 4 + j * 4)
> and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j]
> Could you please check that it would delinearize back to A[i][j]?

I fixed the "hack patch" somewhat but realized it's not really possible
ATM to get back at this form because the array descriptor contains only
information to generate the linearized form.  So while I get now correct
values they end up with runtime divisions which aren't handled by
SCEV.

I fear emitting delinearized code from the Fortran FE would be an
ABI breakage as we'd have to change the array descriptor contents.

> >
> > GCCs dependence analysis can at least compute distances of two
> > DRs when the difference of the access CHRECs is constant.  Within
> > the polyhedral model those cases cannot be handled?
> 
> The difficulty for the polyhedral model is in the representation
> of a multiplication of parameter times loop index variable.
> The delinearization removes these difficulties by creating linear expressions.
> 
> Think about multiplication as something introducing exponentiality
> and you realize that any such expression would not fit in the
> linear model of polyhedra.
> A parameter is nothing else than an outer loop index to which we don't
> have access to that loop level as it may be outside the current function
> in which we get that parameter in.

Yeah, I see that now.

Richard.


Re: [PATCH][GRAPHITE] More TLC

2017-09-28 Thread Sebastian Pop
On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener  wrote:
> Looks like even when hacking the Fortran FE to produce nested
> ARRAY_REFs we run into the same issue for
>
> (gdb) p debug_data_reference (dr)
> #(Data Ref:
> #  bb: 17
> #  stmt:
> VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
> 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> sz: 8} = 0.0;
> #  ref:
> VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
> 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> sz: 8};
> #  base_object:
> VIEW_CONVERT_EXPR(*y_117(D));
> #  Access function 0: {1, +, 1}_4
> #  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +,
> (unsigned long) stride.88_92}_3;
> #  Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +,
> (unsigned long) stride.90_96}_2;
> #  Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +,
> (unsigned long) stride.92_100}_1;
>
> so it looks like simple strided (where stride is a parameter) access
> is not handled either.

Yes, this is the first option I was mentioning: it could work,
could you please make sure that you don't have a bug in the "hack patch"
where the outer dimension should not contain the parameter
(inner array dimension) times the access function.

Example in C:
int A[100][N];
A[i][j] is linearized as *(A + i * N * 4 + j * 4)
and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j]
Could you please check that it would delinearize back to A[i][j]?

>
> GCCs dependence analysis can at least compute distances of two
> DRs when the difference of the access CHRECs is constant.  Within
> the polyhedral model those cases cannot be handled?

The difficulty for the polyhedral model is in the representation
of a multiplication of parameter times loop index variable.
The delinearization removes these difficulties by creating linear expressions.

Think about multiplication as something introducing exponentiality
and you realize that any such expression would not fit in the
linear model of polyhedra.
A parameter is nothing else than an outer loop index to which we don't
have access to that loop level as it may be outside the current function
in which we get that parameter in.

Sebastian


Re: [PATCH][GRAPHITE] More TLC

2017-09-28 Thread Sebastian Pop
On Wed, Sep 27, 2017 at 8:04 AM, Richard Biener  wrote:
>
> Another thing I notice is that we don't handle the multi-dimensional
> accesses the fortran frontend produces:
>
> (gdb) p debug_data_reference (dr)
> #(Data Ref:
> #  bb: 18
> #  stmt: _43 = *a_141(D)[_42];
> #  ref: *a_141(D)[_42];
> #  base_object: *a_141(D);
> #  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +,
> stride.88_115}_5
>
> ultimatively we fail here because we try to build a constraint for
>
> {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5
>
> which ends up computing isl_pw_aff_mul (A, stride.88_115) with
> A being the non-constant constraint generated for
> {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
> a parameter.  ISL doesn't like that multiplication as the result
> isn't affine (well - it is, we just have parameters in there).
>
> I suppose ISL doesn't handle this form of accesses given the
> two "dimensions" in this scalarized form may overlap?  So we'd
> really need to turn those into references with different access
> functions (even if that's not 100% a valid semantic transformation
> as scalarization isn't reversible without extra information)?

You are right.
This multivariate memory access would be better handled in
delinearized form:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66981
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61000
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

There are two ways to handle this issue:
- fix the FORTRAN front-end to emit multi dimensions ARRAY_REFs,
- implement an array delinearization pass, as I implemented in LLVM
http://llvm.org/doxygen/Delinearization_8cpp_source.html
"On Recovering Multi-Dimensional Arrays in Polly"
http://impact.gforge.inria.fr/impact2015/papers/impact2015-grosser.pdf
"Optimistic Delinearization of Parametrically Sized Arrays"
https://dl.acm.org/citation.cfm?id=2751248
LLVM does not have an equivalent for multi-dim ARRAY_REF description
it only reasons about linearized memory accesses like in GCC's RTL:
gep = Get Element Pointer, so we had no other option than to delinearize.

Sebastian


Re: [PATCH][GRAPHITE] More TLC

2017-09-28 Thread Sebastian Pop
On Wed, Sep 27, 2017 at 7:18 AM, Richard Biener  wrote:

> On Tue, 26 Sep 2017, Sebastian Pop wrote:
>
> > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener 
> wrote:
> >
> > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > >
> > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener 
> > > wrote:
> > > >
> > > > >
> > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I
> made
> > > > > when reworking canonicalize_loop_closed_ssa.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to
> trunk.
> > > > >
> > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > >
> > > > >  61 loop nests "optimized"
> > > > >  45 loop nest transforms cancelled because of code generation
> issues
> > > > >  21 loop nest optimizations timed out the 35 ISL "operations"
> we
> > > allow
> > > > >
> > > > > I say "optimized" because the usual transform I've seen is static
> > > tiling
> > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > There's no way to automagically figure what kind of transform ISL
> did
> > > > >
> > > >
> > > > Here is how to automate (without magic) the detection
> > > > of the transform that isl did.
> > > >
> > > > The problem solved by isl is the minimization of strides
> > > > in memory, and to do this, we need to tell the isl scheduler
> > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > (RAR + validity) maps.  The proximity does include the
> > > > read after read, as the isl scheduler needs to minimize
> > > > strides between consecutive reads.
>
> Ah, so I now see why we do not perform interchange on trivial cases like
>
> double A[1024][1024], B[1024][1024];
>
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i];
> }
>
> which is probably because
>
>   /* FIXME: proximity should not be validity.  */
>   isl_union_map *proximity = isl_union_map_copy (validity);
>
> falls apart when there is _no_ dependence?
>

You are right.  The proximity needs to account for spatial
locality as well if you want to interchange the loop.
To describe the spatial locality, I would recommend adding
to the proximity relation the array accesses from two
successive iterations of the innermost loop:
A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
With these two extra relations in the proximity map,
isl should be able to interchange the above loop.


>
> I can trick GRAPHITE into performing the interchange for
>
> double A[1024][1024], B[1024][1024];
>
> void foo(void)
> {
>   for (int i = 1; i < 1023; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i-1] + A[j][i+1];
> }
>
> because now there is a dependence.  Any idea on how to rewrite
> scop_get_dependences to avoid "simplifying"?  I suppose the
> validity constraints _do_ also specify kind-of a proximity
>

Correct: the validity map specifies a subset (it is missing
RAR dependences) of data reuse.


> we just may not prune / optimize them in the same way as
> dependences?
>

Validity constraints are there to "keep the wind blowing
in the same direction" after the transform (otherwise the
result of the transformed computation may be wrong.)

The proximity map should contain a description of
- reuse of memory (temporal locality)
- how close together the access elements are (spatial locality.)
isl will optimize for both if the proximity map has a description
of both.

For the moment the proximity map is initialized only with the
current validity constraints, as you quoted the FIXME comment,
which would only describe a subset of the temporal locality.

Sebastian


Re: [PATCH][GRAPHITE] More TLC

2017-09-28 Thread Sebastian Pop
Hi skimo,

On Tue, Sep 26, 2017 at 10:15 AM, Sven Verdoolaege <
sven.verdoola...@gmail.com> wrote:

> On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote:
> > Sven, is there already a function that computes the sum of all
> > strides in a proximity map?  Maybe you have code that does
> > something similar in pet or ppcg?
>
> What exactly do you want to sum?

If this involves any counting, then it cannot currently
>

I think that it does involve counting: we need to know
the distance between all pairs of array accesses, that is the
number of points in the dependence polyhedron.


> be done in pet or ppcg since isl does not support counting yet
> and the public version of barvinok is GPL licensed.
>
> Also, it's better to ask such questions on the isl mailing list
> isl-developm...@googlegroups.com
>
>
We are trying to find a metric that shows that isl's scheduler
did a useful transform.  Something like a diff tool that shows
before and after scheduling the strides of array accesses.

Could the isl scheduler output a description of what it did?
We would like to use that output to build testcases that match
the behavior of the compiler on different patterns.

Thanks,
Sebastian


Re: [PATCH][GRAPHITE] More TLC

2017-09-27 Thread Richard Biener
On Wed, 27 Sep 2017, Richard Biener wrote:

> On Wed, 27 Sep 2017, Richard Biener wrote:
> 
> > On Tue, 26 Sep 2017, Sebastian Pop wrote:
> > 
> > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener  wrote:
> > > 
> > > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > > >
> > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener 
> > > > wrote:
> > > > >
> > > > > >
> > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > > > when reworking canonicalize_loop_closed_ssa.
> > > > > >
> > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to 
> > > > > > trunk.
> > > > > >
> > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > > >
> > > > > >  61 loop nests "optimized"
> > > > > >  45 loop nest transforms cancelled because of code generation issues
> > > > > >  21 loop nest optimizations timed out the 35 ISL "operations" we
> > > > allow
> > > > > >
> > > > > > I say "optimized" because the usual transform I've seen is static
> > > > tiling
> > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > > There's no way to automagically figure what kind of transform ISL 
> > > > > > did
> > > > > >
> > > > >
> > > > > Here is how to automate (without magic) the detection
> > > > > of the transform that isl did.
> > > > >
> > > > > The problem solved by isl is the minimization of strides
> > > > > in memory, and to do this, we need to tell the isl scheduler
> > > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > > (RAR + validity) maps.  The proximity does include the
> > > > > read after read, as the isl scheduler needs to minimize
> > > > > strides between consecutive reads.
> > 
> > Ah, so I now see why we do not perform interchange on trivial cases like
> > 
> > double A[1024][1024], B[1024][1024];
> > 
> > void foo(void)
> > {
> >   for (int i = 0; i < 1024; ++i)
> > for (int j = 0; j < 1024; ++j)
> >   A[j][i] = B[j][i];
> > }
> > 
> > which is probably because
> > 
> >   /* FIXME: proximity should not be validity.  */
> >   isl_union_map *proximity = isl_union_map_copy (validity);
> > 
> > falls apart when there is _no_ dependence?
> > 
> > I can trick GRAPHITE into performing the interchange for
> > 
> > double A[1024][1024], B[1024][1024];
> > 
> > void foo(void)
> > {
> >   for (int i = 1; i < 1023; ++i)
> > for (int j = 0; j < 1024; ++j)
> >   A[j][i] = B[j][i-1] + A[j][i+1];
> > }
> > 
> > because now there is a dependence.  Any idea on how to rewrite
> > scop_get_dependences to avoid "simplifying"?  I suppose the
> > validity constraints _do_ also specify kind-of a proximity
> > we just may not prune / optimize them in the same way as
> > dependences?
> 
> Another thing I notice is that we don't handle the multi-dimensional
> accesses the fortran frontend produces:
> 
> (gdb) p debug_data_reference (dr)
> #(Data Ref: 
> #  bb: 18 
> #  stmt: _43 = *a_141(D)[_42];
> #  ref: *a_141(D)[_42];
> #  base_object: *a_141(D);
> #  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, 
> stride.88_115}_5
> 
> ultimatively we fail here because we try to build a constraint for
> 
> {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5
> 
> which ends up computing isl_pw_aff_mul (A, stride.88_115) with
> A being the non-constant constraint generated for
> {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
> a parameter.  ISL doesn't like that multiplication as the result
> isn't affine (well - it is, we just have parameters in there).
> 
> I suppose ISL doesn't handle this form of accesses given the
> two "dimensions" in this scalarized form may overlap?  So we'd
> really need to turn those into references with different access
> functions (even if that's not 100% a valid semantic transformation
> as scalarization isn't reversible without extra information)?

Looks like even when hacking the Fortran FE to produce nested
ARRAY_REFs we run into the same issue for

(gdb) p debug_data_reference (dr)
#(Data Ref: 
#  bb: 17 
#  stmt: 
VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
 
1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 
sz: 8} = 0.0;
#  ref: 
VIEW_CONVERT_EXPR(*y_117(D))[_24]{lb:
 
1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 
sz: 8};
#  base_object: 
VIEW_CONVERT_EXPR(*y_117(D));
#  Access function 0: {1, +, 1}_4
#  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, 
(unsigned long) stride.88_92}_3;
#  Access function 2: (integer(kind=8)) {(unsigned 

Re: [PATCH][GRAPHITE] More TLC

2017-09-27 Thread Richard Biener
On Wed, 27 Sep 2017, Richard Biener wrote:

> On Tue, 26 Sep 2017, Sebastian Pop wrote:
> 
> > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener  wrote:
> > 
> > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > >
> > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener 
> > > wrote:
> > > >
> > > > >
> > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > > when reworking canonicalize_loop_closed_ssa.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > > > >
> > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > >
> > > > >  61 loop nests "optimized"
> > > > >  45 loop nest transforms cancelled because of code generation issues
> > > > >  21 loop nest optimizations timed out the 35 ISL "operations" we
> > > allow
> > > > >
> > > > > I say "optimized" because the usual transform I've seen is static
> > > tiling
> > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > There's no way to automagically figure what kind of transform ISL did
> > > > >
> > > >
> > > > Here is how to automate (without magic) the detection
> > > > of the transform that isl did.
> > > >
> > > > The problem solved by isl is the minimization of strides
> > > > in memory, and to do this, we need to tell the isl scheduler
> > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > (RAR + validity) maps.  The proximity does include the
> > > > read after read, as the isl scheduler needs to minimize
> > > > strides between consecutive reads.
> 
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i];
> }
> 
> which is probably because
> 
>   /* FIXME: proximity should not be validity.  */
>   isl_union_map *proximity = isl_union_map_copy (validity);
> 
> falls apart when there is _no_ dependence?
> 
> I can trick GRAPHITE into performing the interchange for
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 1; i < 1023; ++i)
> for (int j = 0; j < 1024; ++j)
>   A[j][i] = B[j][i-1] + A[j][i+1];
> }
> 
> because now there is a dependence.  Any idea on how to rewrite
> scop_get_dependences to avoid "simplifying"?  I suppose the
> validity constraints _do_ also specify kind-of a proximity
> we just may not prune / optimize them in the same way as
> dependences?

Another thing I notice is that we don't handle the multi-dimensional
accesses the fortran frontend produces:

(gdb) p debug_data_reference (dr)
#(Data Ref: 
#  bb: 18 
#  stmt: _43 = *a_141(D)[_42];
#  ref: *a_141(D)[_42];
#  base_object: *a_141(D);
#  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, 
stride.88_115}_5

ultimatively we fail here because we try to build a constraint for

{{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5

which ends up computing isl_pw_aff_mul (A, stride.88_115) with
A being the non-constant constraint generated for
{(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
a parameter.  ISL doesn't like that multiplication as the result
isn't affine (well - it is, we just have parameters in there).

I suppose ISL doesn't handle this form of accesses given the
two "dimensions" in this scalarized form may overlap?  So we'd
really need to turn those into references with different access
functions (even if that's not 100% a valid semantic transformation
as scalarization isn't reversible without extra information)?

Thanks,
Richard.


Re: [PATCH][GRAPHITE] More TLC

2017-09-27 Thread Richard Biener
On Tue, 26 Sep 2017, Sebastian Pop wrote:

> On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener  wrote:
> 
> > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> >
> > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener 
> > wrote:
> > >
> > > >
> > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > when reworking canonicalize_loop_closed_ssa.
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > > >
> > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > -Ofast -march=haswell -floop-nest-optimize are
> > > >
> > > >  61 loop nests "optimized"
> > > >  45 loop nest transforms cancelled because of code generation issues
> > > >  21 loop nest optimizations timed out the 35 ISL "operations" we
> > allow
> > > >
> > > > I say "optimized" because the usual transform I've seen is static
> > tiling
> > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > There's no way to automagically figure what kind of transform ISL did
> > > >
> > >
> > > Here is how to automate (without magic) the detection
> > > of the transform that isl did.
> > >
> > > The problem solved by isl is the minimization of strides
> > > in memory, and to do this, we need to tell the isl scheduler
> > > the validity dependence graph, in graphite-optimize-isl.c
> > > see the validity (RAW, WAR, WAW) and the proximity
> > > (RAR + validity) maps.  The proximity does include the
> > > read after read, as the isl scheduler needs to minimize
> > > strides between consecutive reads.

Ah, so I now see why we do not perform interchange on trivial cases like

double A[1024][1024], B[1024][1024];

void foo(void)
{
  for (int i = 0; i < 1024; ++i)
for (int j = 0; j < 1024; ++j)
  A[j][i] = B[j][i];
}

which is probably because

  /* FIXME: proximity should not be validity.  */
  isl_union_map *proximity = isl_union_map_copy (validity);

falls apart when there is _no_ dependence?

I can trick GRAPHITE into performing the interchange for

double A[1024][1024], B[1024][1024];

void foo(void)
{
  for (int i = 1; i < 1023; ++i)
for (int j = 0; j < 1024; ++j)
  A[j][i] = B[j][i-1] + A[j][i+1];
}

because now there is a dependence.  Any idea on how to rewrite
scop_get_dependences to avoid "simplifying"?  I suppose the
validity constraints _do_ also specify kind-of a proximity
we just may not prune / optimize them in the same way as
dependences?

Richard.



Re: [PATCH][GRAPHITE] More TLC

2017-09-26 Thread Sven Verdoolaege
On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote:
> Sven, is there already a function that computes the sum of all
> strides in a proximity map?  Maybe you have code that does
> something similar in pet or ppcg?

What exactly do you want to sum?
If this involves any counting, then it cannot currently
be done in pet or ppcg since isl does not support counting yet
and the public version of barvinok is GPL licensed.

Also, it's better to ask such questions on the isl mailing list
isl-developm...@googlegroups.com

skimo


Re: [PATCH][GRAPHITE] More TLC

2017-09-26 Thread Sebastian Pop
On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener  wrote:

> On Fri, 22 Sep 2017, Sebastian Pop wrote:
>
> > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener 
> wrote:
> >
> > >
> > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > when reworking canonicalize_loop_closed_ssa.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > >
> > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > -Ofast -march=haswell -floop-nest-optimize are
> > >
> > >  61 loop nests "optimized"
> > >  45 loop nest transforms cancelled because of code generation issues
> > >  21 loop nest optimizations timed out the 35 ISL "operations" we
> allow
> > >
> > > I say "optimized" because the usual transform I've seen is static
> tiling
> > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > There's no way to automagically figure what kind of transform ISL did
> > >
> >
> > Here is how to automate (without magic) the detection
> > of the transform that isl did.
> >
> > The problem solved by isl is the minimization of strides
> > in memory, and to do this, we need to tell the isl scheduler
> > the validity dependence graph, in graphite-optimize-isl.c
> > see the validity (RAW, WAR, WAW) and the proximity
> > (RAR + validity) maps.  The proximity does include the
> > read after read, as the isl scheduler needs to minimize
> > strides between consecutive reads.
> >
> > When you apply the schedule to the dependence graph,
> > one can tell from the result the strides in memory, a good
> > way to say whether a transform was beneficial is to sum up
> > all memory strides, and make sure that the sum of all strides
> > decreases after transform.  We could add a printf with the
> > sum of strides before and after transforms, and have the
> > testcases check for that.
>
> Interesting.  Can you perhaps show me in code how to do that?
>
>
Sven, is there already a function that computes the sum of all
strides in a proximity map?  Maybe you have code that does
something similar in pet or ppcg?

Thanks,
Sebastian


Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Richard Biener
On Fri, 22 Sep 2017, Sebastian Pop wrote:

> On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener  wrote:
> 
> >
> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > when reworking canonicalize_loop_closed_ssa.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> >
> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > -Ofast -march=haswell -floop-nest-optimize are
> >
> >  61 loop nests "optimized"
> >  45 loop nest transforms cancelled because of code generation issues
> >  21 loop nest optimizations timed out the 35 ISL "operations" we allow
> >
> > I say "optimized" because the usual transform I've seen is static tiling
> > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > There's no way to automagically figure what kind of transform ISL did
> >
> 
> Here is how to automate (without magic) the detection
> of the transform that isl did.
> 
> The problem solved by isl is the minimization of strides
> in memory, and to do this, we need to tell the isl scheduler
> the validity dependence graph, in graphite-optimize-isl.c
> see the validity (RAW, WAR, WAW) and the proximity
> (RAR + validity) maps.  The proximity does include the
> read after read, as the isl scheduler needs to minimize
> strides between consecutive reads.
> 
> When you apply the schedule to the dependence graph,
> one can tell from the result the strides in memory, a good
> way to say whether a transform was beneficial is to sum up
> all memory strides, and make sure that the sum of all strides
> decreases after transform.  We could add a printf with the
> sum of strides before and after transforms, and have the
> testcases check for that.

Interesting.  Can you perhaps show me in code how to do that?

Thanks,
Richard.


Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Richard Biener
On Mon, 25 Sep 2017, Bin.Cheng wrote:

> On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener  wrote:
> > On Mon, 25 Sep 2017, Richard Biener wrote:
> >
> >> On Fri, 22 Sep 2017, Richard Biener wrote:
> >>
> >> >
> >> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> >> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> >> > when reworking canonicalize_loop_closed_ssa.
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> >> >
> >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> >> > -Ofast -march=haswell -floop-nest-optimize are
> >> >
> >> >  61 loop nests "optimized"
> >> >  45 loop nest transforms cancelled because of code generation issues
> >> >  21 loop nest optimizations timed out the 35 ISL "operations" we 
> >> > allow
> >>
> >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
> >> and 709 sec. with (this was with release checking).
> >>
> >> A single-run has 416.gamess (580s -> 618s),
> >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
> >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s ->
> >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
> >> do a 3-run for those to confirm (it would be only a single regression
> >> for 416.gamess).
> >
> > 416.gamess regression confirmed, 450.soplex improvement as well,
> 436/437 improvements?  450.soplex (229s -> 226s) loops like noise.

base is with -floop-nest-optimize, peak without.

416.gamess  19580619   31.7 S   19580576   
34.0 *
416.gamess  19580614   31.9 S   19580577   
33.9 S
416.gamess  19580618   31.7 *   19580576   
34.0 S
436.cactusADM   11950194   61.5 S   11950204   
58.5 S
436.cactusADM   11950184   65.0 S   11950187   
63.8 *
436.cactusADM   11950186   64.1 *   11950186   
64.1 S
437.leslie3d 9400219   43.0 S9400218   
43.1 S
437.leslie3d 9400219   43.0 *9400223   
42.1 S
437.leslie3d 9400218   43.0 S9400223   
42.2 *
450.soplex   8340225   37.0 S8340231   
36.1 S
450.soplex   8340226   36.9 *8340230   
36.3 *
450.soplex   8340227   36.8 S8340229   
36.4 S
465.tonto9840426   23.1 S9840427   
23.0 *
465.tonto9840424   23.2 S9840430   
22.9 S
465.tonto9840425   23.2 *9840425   
23.2 S
401.bzip29650379   25.5 S9650378   
25.5 S
401.bzip29650379   25.5 *9650380   
25.4 *
401.bzip29650379   25.5 S9650380   
25.4 S
462.libquantum  20720351   59.0 *   20720349   
59.4 S
462.libquantum  20720351   59.0 S   20720345   
60.1 *
462.libquantum  20720352   58.8 S   20720344   
60.2 S



> Thanks,
> bin
> > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
> > that's noise.
> >
> > Richard.
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Bin.Cheng
On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener  wrote:
> On Mon, 25 Sep 2017, Richard Biener wrote:
>
>> On Fri, 22 Sep 2017, Richard Biener wrote:
>>
>> >
>> > This simplifies canonicalize_loop_closed_ssa and does other minimal
>> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
>> > when reworking canonicalize_loop_closed_ssa.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>> >
>> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
>> > -Ofast -march=haswell -floop-nest-optimize are
>> >
>> >  61 loop nests "optimized"
>> >  45 loop nest transforms cancelled because of code generation issues
>> >  21 loop nest optimizations timed out the 35 ISL "operations" we allow
>>
>> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
>> and 709 sec. with (this was with release checking).
>>
>> A single-run has 416.gamess (580s -> 618s),
>> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
>> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s ->
>> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
>> do a 3-run for those to confirm (it would be only a single regression
>> for 416.gamess).
>
> 416.gamess regression confirmed, 450.soplex improvement as well,
436/437 improvements?  450.soplex (229s -> 226s) loops like noise.

Thanks,
bin
> in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
> that's noise.
>
> Richard.


Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Richard Biener
On Mon, 25 Sep 2017, Richard Biener wrote:

> On Fri, 22 Sep 2017, Richard Biener wrote:
> 
> > 
> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > when reworking canonicalize_loop_closed_ssa.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > 
> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > -Ofast -march=haswell -floop-nest-optimize are
> > 
> >  61 loop nests "optimized"
> >  45 loop nest transforms cancelled because of code generation issues
> >  21 loop nest optimizations timed out the 35 ISL "operations" we allow
> 
> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
> and 709 sec. with (this was with release checking).
> 
> A single-run has 416.gamess (580s -> 618s),
> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 
> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
> do a 3-run for those to confirm (it would be only a single regression
> for 416.gamess).

416.gamess regression confirmed, 450.soplex improvement as well,
in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
that's noise.

Richard.


Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Richard Biener
On Fri, 22 Sep 2017, Richard Biener wrote:

> 
> This simplifies canonicalize_loop_closed_ssa and does other minimal
> TLC.  It also adds a testcase I reduced from a stupid mistake I made
> when reworking canonicalize_loop_closed_ssa.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> 
> SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> -Ofast -march=haswell -floop-nest-optimize are
> 
>  61 loop nests "optimized"
>  45 loop nest transforms cancelled because of code generation issues
>  21 loop nest optimizations timed out the 35 ISL "operations" we allow

Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
and 709 sec. with (this was with release checking).

A single-run has 416.gamess (580s -> 618s),
436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 
379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
do a 3-run for those to confirm (it would be only a single regression
for 416.gamess).

Sofar I'm positively surprised given the limitations (and inefficiencies)
I know.

I'll add some more opt-info stuff to assess the number of SCOPs we
detect but discard during further analysis and the number of transforms
we cancel because they turn out as a no-op.

Richard.

>  {
> -  if (dump_file && dump_flags)
> - fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n",
> -  max_operations);
> +  location_t loc = find_loop_location
> + (scop->scop_info->region.entry->dest->loop_father);
> +  dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc,
> +"loop nest not optimized, optimization timed out "
> +"after %d operations [--param max-isl-operations]\n",
> +max_operations);
>return false;
>  }
>  
> Index: gcc/graphite.c
> ===
> --- gcc/graphite.c(revision 253091)
> +++ gcc/graphite.c(working copy)
> @@ -293,86 +293,6 @@ free_scops (vec scops)
>scops.release ();
>  }
>  
> -/* Returns true when P1 and P2 are close phis with the same
> -   argument.  */
> -
> -static inline bool
> -same_close_phi_node (gphi *p1, gphi *p2)
> -{
> -  return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)),
> -   TREE_TYPE (gimple_phi_result (p2)))
> -   && operand_equal_p (gimple_phi_arg_def (p1, 0),
> -   gimple_phi_arg_def (p2, 0), 0));
> -}
> -
> -static void make_close_phi_nodes_unique (basic_block bb);
> -
> -/* Remove the close phi node at GSI and replace its rhs with the rhs
> -   of PHI.  */
> -
> -static void
> -remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi)
> -{
> -  gimple *use_stmt;
> -  use_operand_p use_p;
> -  imm_use_iterator imm_iter;
> -  tree res = gimple_phi_result (phi);
> -  tree def = gimple_phi_result (gsi->phi ());
> -
> -  gcc_assert (same_close_phi_node (phi, gsi->phi ()));
> -
> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def)
> -{
> -  FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
> - SET_USE (use_p, res);
> -
> -  update_stmt (use_stmt);
> -
> -  /* It is possible that we just created a duplicate close-phi
> -  for an already-processed containing loop.  Check for this
> -  case and clean it up.  */
> -  if (gimple_code (use_stmt) == GIMPLE_PHI
> -   && gimple_phi_num_args (use_stmt) == 1)
> - make_close_phi_nodes_unique (gimple_bb (use_stmt));
> -}
> -
> -  remove_phi_node (gsi, true);
> -}
> -
> -/* Removes all the close phi duplicates from BB.  */
> -
> -static void
> -make_close_phi_nodes_unique (basic_block bb)
> -{
> -  gphi_iterator psi;
> -
> -  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next ())
> -{
> -  gphi_iterator gsi = psi;
> -  gphi *phi = psi.phi ();
> -
> -  /* At this point, PHI should be a close phi in normal form.  */
> -  gcc_assert (gimple_phi_num_args (phi) == 1);
> -
> -  /* Iterate over the next phis and remove duplicates.  */
> -  gsi_next ();
> -  while (!gsi_end_p (gsi))
> - if (same_close_phi_node (phi, gsi.phi ()))
> -   remove_duplicate_close_phi (phi, );
> - else
> -   gsi_next ();
> -}
> -}
> -
> -/* Return true when NAME is defined in LOOP.  */
> -
> -static bool
> -defined_in_loop_p (tree name, loop_p loop)
> -{
> -  gcc_assert (TREE_CODE (name) == SSA_NAME);
> -  return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name));
> -}
> -
>  /* Transforms LOOP to the canonical loop closed SSA form.  */
>  
>  static void
> @@ -380,20 +300,22 @@ canonicalize_loop_closed_ssa (loop_p loo
>  {
>edge e = single_exit (loop);
>basic_block bb;
> +  gphi_iterator psi;
>  
>if (!e || (e->flags & EDGE_COMPLEX))
>  return;
>  
>bb = e->dest;
>  
> +  /* Make the loop-close PHI node BB contain only PHIs and have a
> + single predecessor.  */
>

Re: [PATCH][GRAPHITE] More TLC

2017-09-22 Thread Sebastian Pop
On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener  wrote:

>
> This simplifies canonicalize_loop_closed_ssa and does other minimal
> TLC.  It also adds a testcase I reduced from a stupid mistake I made
> when reworking canonicalize_loop_closed_ssa.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>
> SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> -Ofast -march=haswell -floop-nest-optimize are
>
>  61 loop nests "optimized"
>  45 loop nest transforms cancelled because of code generation issues
>  21 loop nest optimizations timed out the 35 ISL "operations" we allow
>
> I say "optimized" because the usual transform I've seen is static tiling
> as enforced by GRAPHITE according to --param loop-block-tile-size.
> There's no way to automagically figure what kind of transform ISL did
>

Here is how to automate (without magic) the detection
of the transform that isl did.

The problem solved by isl is the minimization of strides
in memory, and to do this, we need to tell the isl scheduler
the validity dependence graph, in graphite-optimize-isl.c
see the validity (RAW, WAR, WAW) and the proximity
(RAR + validity) maps.  The proximity does include the
read after read, as the isl scheduler needs to minimize
strides between consecutive reads.

When you apply the schedule to the dependence graph,
one can tell from the result the strides in memory, a good
way to say whether a transform was beneficial is to sum up
all memory strides, and make sure that the sum of all strides
decreases after transform.  We could add a printf with the
sum of strides before and after transforms, and have the
testcases check for that.

(usually none with the schedule identical check confused by FILTER
> stuff positioning).  This is also the issue with most GRAPHITE testcases.
> We can't really verify easily whether we performed loop interchange
> or not.  We can probably tell whether we applied loop fusion or
> splitting (by counting loops).
>
> I'm not aware of any remaining ICEs / wrong-code issues with GRAPHITE.
>
> I'm aware that the current "black-box" granularity hinders
> scheduling freedom (each GIMPLE BB is mapped to a ISL stmt, this
> is too coarse to schedule say two writes in a BB independently
> from each other).  Quick experiments could be done by simply
> splitting gimple BBs at some points.
>
> I'm aware that the SCOP detection algorithm assumes that it can
> walk loop->next and find loops "in order" -- but while that's
> true for the initial flow_loops_find result (DFS walk) it isn't
> true for any later created / discovered loops.  Sorting of
> loop siblings in DFS order should be easy (and a general cfgloopanal
> helper).
>
> Richard.
>
> 2017-09-22  Richard Biener  
>
> * graphite-isl-ast-to-gimple.c (graphite_verify): Inline into
> single caller.
> (graphite_regenerate_ast_isl): Do not reset SCEV.  Move debug
> print of no dependency loops ...
> * graphite.c (graphite_transform_loops): ... here.
> (canonicalize_loop_closed_ssa_form): Work from inner to outer
> loops.
> (same_close_phi_node, remove_duplicate_close_phi,
> make_close_phi_nodes_unique, defined_in_loop_p): Fold into ...
> (canonicalize_loop_closed_ssa): ... here and simplify.
> * graphite-optimize-isl.c: Include tree-vectorizer.h.
> (optimize_isl): Use dump_printf_loc to tell when we stopped
> optimizing because of an ISL timeout.
>
> * gcc.dg/graphite/scop-24.c: New testcase.
>
>
The change looks good to me.

Thanks,
Sebastian