Re: Add a loop versioning pass

2018-12-15 Thread Richard Biener
On December 14, 2018 5:18:38 PM GMT+01:00, Richard Sandiford 
 wrote:
>Richard Biener  writes:
>> On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford
> wrote:
>>>Richard Biener  writes:
 On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
> Also repeated the performance testing (but haven't yet tried an
> LTO variant; will do that over the weekend).

 Any results?
>>>
>>>Sorry, I should've remembered that finding time to run tests is easy,
>>>finding time to analyse them is hard.
>>>
>>>Speed-wise, the impact of the patch for LTO is similar to without,
>>>with 554.roms_r being the main beneficiary for both AArch64 and
>x86_64.
>>>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>>>-flto=jobserver.
>>>
>>>Size-wise, there are three tests that grow by >=2% on x86_64:
>>>
>>>549.fotonik3d_r: 5.5%
>>>548.exchange2_r: 29.5%
>>>554.roms_r: 39.6%
>>
>> Uh. With LTO we might have a reasonable guessed profile and you do
>have a optimize_loop_nest_for_speed guard on the transform? 
>
>Guard now added :-)  But unfortunately it doesn't make any significant
>difference.  548.exchange2_r goes from 29.5% to 27.7%, but the other
>two are the same as before.
>
>> How does compile time fare with the above benchmarks?
>
>For 554.roms_r it's +80%(!) with -flto=1, but I think that's par for
>the course given the increase in function sizes.

:(

>For 549.fotonik3d_r it's +5% with -flto=1.
>
>For 503.bwaves_r (as an example of a benchmark whose size doesn't
>change),
>the difference is in the noise.
>
>[...]
>
>>>You mean something like:
>>>
>>>  real :: foo(:,:), bar(:)
>>>
>>>  do i=1,n
>>>do j=1,n
>>>  foo(i,j) = ...
>>>end do
>>>bar(i) = ..
>>>  end do
>>>
>>>?  I can add a test if so.
>>
>> Please. 
>
>OK, I've added them to loop_versioning_4.f90.
>
 There may also be some subtle issues with substitute_and_fold being
 applied to non-up-to-date SSA form given it folds stmts looking at
 (single-use!) SSA edges.  The single-use guard might be what saves
>>>you
 here (SSA uses in the copies are not yet updated to point to the
 copied DEFs).
>>>
>>>OK.  I was hoping that because we only apply substitute_and_fold
>>>to new code, there would be no problem with uses elsewhere.
>>
>> Might be, yes.
>>
>>>Would it be safer to:
>>>
>>>  - version all loops we want to version
>>>  - update SSA explicitly
>>>  - apply substitute and fold to all "new" loops
>>
>> That would be definitely less fishy. But you need to get at the
>actual
>> 'new' SSA names for the replacements as I guess they'll be rewritten?
>> Or maybe those are not.
>>
>>>?  Could we then get away with returning a 0 TODO at the end?
>>
>> Yes. 
>
>OK, the updated patch does it this way.
>
>Tested as before.

OK. 

Thanks, 
Richard. 

>Thanks,
>Richard
>
>
>2018-12-14  Richard Sandiford  
>   Ramana Radhakrishnan  
>   Kyrylo Tkachov  
>
>gcc/
>   * doc/invoke.texi (-fversion-loops-for-strides): Document
>   (loop-versioning-group-size, loop-versioning-max-inner-insns)
>   (loop-versioning-max-outer-insns): Document new --params.
>   * Makefile.in (OBJS): Add gimple-loop-versioning.o.
>   * common.opt (fversion-loops-for-strides): New option.
>   * opts.c (default_options_table): Enable fversion-loops-for-strides
>   at -O3.
>   * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
>   (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
>   (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
>   * passes.def: Add pass_loop_versioning.
>   * timevar.def (TV_LOOP_VERSIONING): New time variable.
>   * tree-ssa-propagate.h
>   (substitute_and_fold_engine::substitute_and_fold): Add an optional
>   block parameter.
>   * tree-ssa-propagate.c
>   (substitute_and_fold_engine::substitute_and_fold): Likewise.
>   When passed, only walk blocks dominated by that block.
>   * tree-vrp.h (range_includes_p): Declare.
>   (range_includes_zero_p): Turn into an inline wrapper around
>   range_includes_p.
>   * tree-vrp.c (range_includes_p): New function, generalizing...
>   (range_includes_zero_p): ...this.
>   * tree-pass.h (make_pass_loop_versioning): Declare.
>   * gimple-loop-versioning.cc: New file.
>
>gcc/testsuite/
>   * gcc.dg/loop-versioning-1.c: New test.
>   * gcc.dg/loop-versioning-10.c: Likewise.
>   * gcc.dg/loop-versioning-11.c: Likewise.
>   * gcc.dg/loop-versioning-2.c: Likewise.
>   * gcc.dg/loop-versioning-3.c: Likewise.
>   * gcc.dg/loop-versioning-4.c: Likewise.
>   * gcc.dg/loop-versioning-5.c: Likewise.
>   * gcc.dg/loop-versioning-6.c: Likewise.
>   * gcc.dg/loop-versioning-7.c: Likewise.
>   * gcc.dg/loop-versioning-8.c: Likewise.
>   * gcc.dg/loop-versioning-9.c: Likewise.
>   * gfortran.dg/loop_versioning_1.f90: Likewise.
>   * gfortran.dg/loop_versioning_2.f90: 

Re: Add a loop versioning pass

2018-12-14 Thread Richard Sandiford
Richard Biener  writes:
> On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford 
>  wrote:
>>Richard Biener  writes:
>>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
 Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
 Also repeated the performance testing (but haven't yet tried an
 LTO variant; will do that over the weekend).
>>>
>>> Any results?
>>
>>Sorry, I should've remembered that finding time to run tests is easy,
>>finding time to analyse them is hard.
>>
>>Speed-wise, the impact of the patch for LTO is similar to without,
>>with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
>>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>>-flto=jobserver.
>>
>>Size-wise, there are three tests that grow by >=2% on x86_64:
>>
>>549.fotonik3d_r: 5.5%
>>548.exchange2_r: 29.5%
>>554.roms_r: 39.6%
>
> Uh. With LTO we might have a reasonable guessed profile and you do have a 
> optimize_loop_nest_for_speed guard on the transform? 

Guard now added :-)  But unfortunately it doesn't make any significant
difference.  548.exchange2_r goes from 29.5% to 27.7%, but the other
two are the same as before.

> How does compile time fare with the above benchmarks?

For 554.roms_r it's +80%(!) with -flto=1, but I think that's par for
the course given the increase in function sizes.

For 549.fotonik3d_r it's +5% with -flto=1.

For 503.bwaves_r (as an example of a benchmark whose size doesn't change),
the difference is in the noise.

[...]

>>You mean something like:
>>
>>  real :: foo(:,:), bar(:)
>>
>>  do i=1,n
>>do j=1,n
>>  foo(i,j) = ...
>>end do
>>bar(i) = ..
>>  end do
>>
>>?  I can add a test if so.
>
> Please. 

OK, I've added them to loop_versioning_4.f90.

>>> There may also be some subtle issues with substitute_and_fold being
>>> applied to non-up-to-date SSA form given it folds stmts looking at
>>> (single-use!) SSA edges.  The single-use guard might be what saves
>>you
>>> here (SSA uses in the copies are not yet updated to point to the
>>> copied DEFs).
>>
>>OK.  I was hoping that because we only apply substitute_and_fold
>>to new code, there would be no problem with uses elsewhere.
>
> Might be, yes.
>
>>Would it be safer to:
>>
>>  - version all loops we want to version
>>  - update SSA explicitly
>>  - apply substitute and fold to all "new" loops
>
> That would be definitely less fishy. But you need to get at the actual
> 'new' SSA names for the replacements as I guess they'll be rewritten?
> Or maybe those are not.
>
>>?  Could we then get away with returning a 0 TODO at the end?
>
> Yes. 

OK, the updated patch does it this way.

Tested as before.

Thanks,
Richard


2018-12-14  Richard Sandiford  
Ramana Radhakrishnan  
Kyrylo Tkachov  

gcc/
* doc/invoke.texi (-fversion-loops-for-strides): Document
(loop-versioning-group-size, loop-versioning-max-inner-insns)
(loop-versioning-max-outer-insns): Document new --params.
* Makefile.in (OBJS): Add gimple-loop-versioning.o.
* common.opt (fversion-loops-for-strides): New option.
* opts.c (default_options_table): Enable fversion-loops-for-strides
at -O3.
* params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
(PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
(PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
* passes.def: Add pass_loop_versioning.
* timevar.def (TV_LOOP_VERSIONING): New time variable.
* tree-ssa-propagate.h
(substitute_and_fold_engine::substitute_and_fold): Add an optional
block parameter.
* tree-ssa-propagate.c
(substitute_and_fold_engine::substitute_and_fold): Likewise.
When passed, only walk blocks dominated by that block.
* tree-vrp.h (range_includes_p): Declare.
(range_includes_zero_p): Turn into an inline wrapper around
range_includes_p.
* tree-vrp.c (range_includes_p): New function, generalizing...
(range_includes_zero_p): ...this.
* tree-pass.h (make_pass_loop_versioning): Declare.
* gimple-loop-versioning.cc: New file.

gcc/testsuite/
* gcc.dg/loop-versioning-1.c: New test.
* gcc.dg/loop-versioning-10.c: Likewise.
* gcc.dg/loop-versioning-11.c: Likewise.
* gcc.dg/loop-versioning-2.c: Likewise.
* gcc.dg/loop-versioning-3.c: Likewise.
* gcc.dg/loop-versioning-4.c: Likewise.
* gcc.dg/loop-versioning-5.c: Likewise.
* gcc.dg/loop-versioning-6.c: Likewise.
* gcc.dg/loop-versioning-7.c: Likewise.
* gcc.dg/loop-versioning-8.c: Likewise.
* gcc.dg/loop-versioning-9.c: Likewise.
* gfortran.dg/loop_versioning_1.f90: Likewise.
* gfortran.dg/loop_versioning_2.f90: Likewise.
* gfortran.dg/loop_versioning_3.f90: Likewise.
* gfortran.dg/loop_versioning_4.f90: Likewise.
* gfortran.dg/loop_versioning_5.f90: Likewise.
* 

Re: Add a loop versioning pass

2018-12-13 Thread Richard Biener
On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford 
 wrote:
>Richard Biener  writes:
>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>>> Also repeated the performance testing (but haven't yet tried an
>>> LTO variant; will do that over the weekend).
>>
>> Any results?
>
>Sorry, I should've remembered that finding time to run tests is easy,
>finding time to analyse them is hard.
>
>Speed-wise, the impact of the patch for LTO is similar to without,
>with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
>-flto=jobserver.
>
>Size-wise, there are three tests that grow by >=2% on x86_64:
>
>549.fotonik3d_r: 5.5%
>548.exchange2_r: 29.5%
>554.roms_r: 39.6%

Uh. With LTO we might have a reasonable guessed profile and you do have a 
optimize_loop_nest_for_speed guard on the transform? 

How does compile time fare with the above benchmarks?

>The roms increase seems fair enough given the benefit, since the
>whole program uses a similar style for handling arrays.
>
>fotonik3d is a mixed bag.  Some versioning conditions are from
>things like:
>
>  subroutine foo(a)
>real :: a(:,:,:)
>a(1,:,:) = ...
>  end subroutine
>
>where we version for the middle dimension having a stride of 1.
>This could be eliminated if we had more semantic information.
>
>Other conditions are for things like:
>
>  subroutine foo(a)
>real :: a(:,:,:)
>a(:,1,:) = ...
>  end subroutine
>
>where the pass really does help, but unfortunately those loops
>aren't hot and might not even be used at all.
>
>Some opportunities are cases in which we avoid gather loads, such as
>in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
>But mp feeds a gather load itself and these assignments aren't a
>critical part of the loop.
>
>(I'm testing on an AVX-only system, so these are synthesised gathers
>rather than real gathers.)
>
>The growth in 548.exchange2_r comes from reasonable changes to cold
>code.
>The test spends almost all its time in __brute_force_MOD_digits_2,
>which
>contains the same code before and after the patch, but which accounts
>for less than 1% of .text before the patch.
>
>> I've skimmed over the updated patch and it looks
>> a lot better now.
>>
>> +bool
>> +loop_versioning
>> +::find_per_loop_multiplication (address_info ,
>address_term_info )
>> +{
>>
>> is that what coding convention allows?
>
>Not sure we've settled on something, so I improvised.
>
>> For grepping I'd then say we should do
>>
>> bool loop_versioning::
>> find_per_loop_multiplication (...)
>>
>> ;)
>
>Sounds good to me.
>
>> Anywhere else we you use
>>
>> loop_versioning::foo
>>
>> so please stick to that.
>
>Yeah, I used that when I could avoid an argument spilling over 80
>chars,
>but I think I missed that the above function now fits into that
>category,
>Will double-check the others.
>
>> Otherwise OK.
>
>Thanks :-)
>
>> I think I don't see a testcase where we could version both loops in a
>nest
>> so I'm not sure whether the transform works fine when you are only
>> updating SSA form at the very end of the pass.
>
>You mean something like:
>
>  real :: foo(:,:), bar(:)
>
>  do i=1,n
>do j=1,n
>  foo(i,j) = ...
>end do
>bar(i) = ..
>  end do
>
>?  I can add a test if so.

Please. 

>At the moment the pass only looks for direct versioning opportunities
>in inner loops, so the assignment to bar wouldn't be treated as a
>versioning opportunity.  

Ah, OK. 

We should still hoist the version check
>for foo outside both loops, which is tested by loop_versioning_4.f90
>for cases in which the outer loop doesn't have its own array
>arithmetic, but isn't tested for cases like the above.
>
>> There may also be some subtle issues with substitute_and_fold being
>> applied to non-up-to-date SSA form given it folds stmts looking at
>> (single-use!) SSA edges.  The single-use guard might be what saves
>you
>> here (SSA uses in the copies are not yet updated to point to the
>> copied DEFs).
>
>OK.  I was hoping that because we only apply substitute_and_fold
>to new code, there would be no problem with uses elsewhere.

Might be, yes.

>Would it be safer to:
>
>  - version all loops we want to version
>  - update SSA explicitly
>  - apply substitute and fold to all "new" loops

That would be definitely less fishy. But you need to get at the actual 'new' 
SSA names for the replacements as I guess they'll be rewritten? Or maybe those 
are not. 

>?  Could we then get away with returning a 0 TODO at the end?

Yes. 

Richard. 

>Thanks,
>Richard



Re: Add a loop versioning pass

2018-12-12 Thread Richard Sandiford
Richard Biener  writes:
> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf.
>> Also repeated the performance testing (but haven't yet tried an
>> LTO variant; will do that over the weekend).
>
> Any results?

Sorry, I should've remembered that finding time to run tests is easy,
finding time to analyse them is hard.

Speed-wise, the impact of the patch for LTO is similar to without,
with 554.roms_r being the main beneficiary for both AArch64 and x86_64.
I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native
-flto=jobserver.

Size-wise, there are three tests that grow by >=2% on x86_64:

549.fotonik3d_r: 5.5%
548.exchange2_r: 29.5%
554.roms_r: 39.6%

The roms increase seems fair enough given the benefit, since the
whole program uses a similar style for handling arrays.

fotonik3d is a mixed bag.  Some versioning conditions are from
things like:

  subroutine foo(a)
real :: a(:,:,:)
a(1,:,:) = ...
  end subroutine

where we version for the middle dimension having a stride of 1.
This could be eliminated if we had more semantic information.

Other conditions are for things like:

  subroutine foo(a)
real :: a(:,:,:)
a(:,1,:) = ...
  end subroutine

where the pass really does help, but unfortunately those loops
aren't hot and might not even be used at all.

Some opportunities are cases in which we avoid gather loads, such as
in the "mp" loads in the hot __material_mod_MOD_mat_updatee function.
But mp feeds a gather load itself and these assignments aren't a
critical part of the loop.

(I'm testing on an AVX-only system, so these are synthesised gathers
rather than real gathers.)

The growth in 548.exchange2_r comes from reasonable changes to cold code.
The test spends almost all its time in __brute_force_MOD_digits_2, which
contains the same code before and after the patch, but which accounts
for less than 1% of .text before the patch.

> I've skimmed over the updated patch and it looks
> a lot better now.
>
> +bool
> +loop_versioning
> +::find_per_loop_multiplication (address_info , address_term_info 
> )
> +{
>
> is that what coding convention allows?

Not sure we've settled on something, so I improvised.

> For grepping I'd then say we should do
>
> bool loop_versioning::
> find_per_loop_multiplication (...)
>
> ;)

Sounds good to me.

> Anywhere else we you use
>
> loop_versioning::foo
>
> so please stick to that.

Yeah, I used that when I could avoid an argument spilling over 80 chars,
but I think I missed that the above function now fits into that category,
Will double-check the others.

> Otherwise OK.

Thanks :-)

> I think I don't see a testcase where we could version both loops in a nest
> so I'm not sure whether the transform works fine when you are only
> updating SSA form at the very end of the pass.

You mean something like:

  real :: foo(:,:), bar(:)

  do i=1,n
do j=1,n
  foo(i,j) = ...
end do
bar(i) = ..
  end do

?  I can add a test if so.

At the moment the pass only looks for direct versioning opportunities
in inner loops, so the assignment to bar wouldn't be treated as a
versioning opportunity.  We should still hoist the version check
for foo outside both loops, which is tested by loop_versioning_4.f90
for cases in which the outer loop doesn't have its own array
arithmetic, but isn't tested for cases like the above.

> There may also be some subtle issues with substitute_and_fold being
> applied to non-up-to-date SSA form given it folds stmts looking at
> (single-use!) SSA edges.  The single-use guard might be what saves you
> here (SSA uses in the copies are not yet updated to point to the
> copied DEFs).

OK.  I was hoping that because we only apply substitute_and_fold
to new code, there would be no problem with uses elsewhere.

Would it be safer to:

  - version all loops we want to version
  - update SSA explicitly
  - apply substitute and fold to all "new" loops

?  Could we then get away with returning a 0 TODO at the end?

Thanks,
Richard


Re: Add a loop versioning pass

2018-12-12 Thread Richard Biener
On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford
 wrote:
>
> Richard Biener  writes:
> >> > The pass contains an awful lot of heuristics :/  Like last year
> >> > with the interchange pass I would suggest to rip most of it out
> >> > and first lay infrastructure with the cases you can positively
> >> > identify without applying heuristics or "hacks" like stripping
> >> > semantically required casts.  That makes it also clear which
> >> > testcases test which code-path.  That said, all the analyze
> >> > multiplications/plusses/factors stuff was extremely hard to review
> >> > and I have no overall picture why this is all so complicated or
> >> > necessary.
> >>
> >> I think the tests do cover most of this -- was glad to see that
> >> when Kyrill tried taking something out, one of the tests started
> >> to fail :-).  And the comments in the tests as well as the code
> >> are supposed to explain why this is desirable.  But I can try to
> >> spruce up the comments in the code if they're not detailed enough.
> >>
> >> The problem is that the motivating (Fortran) case can't be identified
> >> without applying heuristics.  All we see is a collection of adds and
> >> multiplies, with no semantic information to say which multiplies are for
> >> the inner dimension and which aren't.  I agree it would be nicer not
> >> to have them, which is why I'd originally suggested the IFN to provide
> >> information about dimensions.
> >
> > Sure, still a new pass having _all_ the heuristic built in with
> > 10s of testcases do not make it easy to review those heuristics...
>
> Fair :-)  In the patch below I've tried to cut down on the special cases
> and tried to add more comments explaining which patterns the code is
> trying to detect/reject.  Probably the key part is the one beginning:
>
> +   The main difficulty here isn't finding strides that could be used
> +   in a version check (that's pretty easy).  The problem instead is to
> +   avoid versioning for some stride S that is unlikely ever to be 1 at
> +   runtime.  Versioning for S == 1 on its own would lead to unnecessary
> +   code bloat, while adding S == 1 to more realistic version conditions
> +   would lose the optimisation opportunity offered by those other conditions.
>
> >> >> +/* Return true if in principle it is worth versioning an index 
> >> >> fragment of
> >> >> +   the form:
> >> >> +
> >> >> + (i * b * SCALE) / FACTOR
> >> >> +
> >> >> +   for the case in which b == 1.  */
> >> >> +
> >> >> +bool
> >> >> +loop_versioning::acceptable_scale_p (tree scale, poly_uint64 factor)
> >> >> +{
> >> >> +  /* See whether SCALE is a constant multiple of FACTOR, and if the
> >> >> + multiple is small enough for us to treat it as a potential grouped
> >> >> + access.  For example:
> >> >> +
> >> >> +   for (auto i : ...)
> >> >> +  y[i] = f (x[4 * i * stride],
> >> >> +x[4 * i * stride + 1],
> >> >> +x[4 * i * stride + 2]);
> >> >> +
> >> >> + would benefit from versioning for the case in which stride == 1.
> >> >> + High multiples of i * stride are less likely to benefit, and could
> >> >> + indicate a simulated multi-dimensional array.
> >> >> +
> >> >> + This is just a heuristic, to avoid having to do expensive group
> >> >> + analysis of the data references in a loop.  */
> >> >> +  poly_uint64 const_scale;
> >> >> +  unsigned int multiple;
> >> >> +  if (poly_int_tree_p (scale, _scale)
> >> >> +  && constant_multiple_p (const_scale, factor, ))
> >> >> +{
> >> >> +  unsigned int maxval = PARAM_VALUE 
> >> >> (PARAM_LOOP_VERSIONING_GROUP_SIZE);
> >> >> +  return IN_RANGE (multiple, 1, maxval);
> >> >
> >> > Hmm.  So you _do_ want to version sth like
> >> >
> >> > struct X { int i; int j; } a[2048];
> >> >
> >> > for (int i = start; i < end; ++i)
> >> >   a[i*s].i = 1;
> >> >
> >> > ?  That is, even with s == 1 the accesses will not become contiguous?
> >> > OK, I suppose that's because you are looking at a stmt in isolation
> >> > and another stmt may access a[i*s].j here.
> >> >
> >> > That is, would it be a future improvement to run sth like the
> >> > vectorizers group access analysis on the references and perform
> >> > this check on whole groups then, possibly better being able to
> >> > constrain what is now the magic parameter 
> >> > PARAM_LOOP_VERSIONING_GROUP_SIZE?
> >>
> >> Yeah, possibly.  The problem is that we might end up having to reproduce
> >> vectoriser heuristics.  E.g. stores with gaps should be vectorisable
> >> in future with SVE, using contiguous predicated stores in which some
> >> lanes are inactive.  So we don't necessarily need the .j access for
> >> this to be worthwhile.
> >>
> >> But perhaps the argument for versioning for vectorisation is stronger
> >> for grouped accesses than contiguous ones.
> >>
> >> Note that the above example doesn't rely on the grouping heuristic
> >> since we just strip the COMPONENT_REF and then analyze the ARRAY_REF
> >> with a 

Re: Add a loop versioning pass

2018-12-06 Thread Richard Sandiford
Richard Biener  writes:
>> > The pass contains an awful lot of heuristics :/  Like last year
>> > with the interchange pass I would suggest to rip most of it out
>> > and first lay infrastructure with the cases you can positively
>> > identify without applying heuristics or "hacks" like stripping
>> > semantically required casts.  That makes it also clear which
>> > testcases test which code-path.  That said, all the analyze
>> > multiplications/plusses/factors stuff was extremely hard to review
>> > and I have no overall picture why this is all so complicated or
>> > necessary.
>> 
>> I think the tests do cover most of this -- was glad to see that
>> when Kyrill tried taking something out, one of the tests started
>> to fail :-).  And the comments in the tests as well as the code
>> are supposed to explain why this is desirable.  But I can try to
>> spruce up the comments in the code if they're not detailed enough.
>> 
>> The problem is that the motivating (Fortran) case can't be identified
>> without applying heuristics.  All we see is a collection of adds and
>> multiplies, with no semantic information to say which multiplies are for
>> the inner dimension and which aren't.  I agree it would be nicer not
>> to have them, which is why I'd originally suggested the IFN to provide
>> information about dimensions.
>
> Sure, still a new pass having _all_ the heuristic built in with
> 10s of testcases do not make it easy to review those heuristics...

Fair :-)  In the patch below I've tried to cut down on the special cases
and tried to add more comments explaining which patterns the code is
trying to detect/reject.  Probably the key part is the one beginning:

+   The main difficulty here isn't finding strides that could be used
+   in a version check (that's pretty easy).  The problem instead is to
+   avoid versioning for some stride S that is unlikely ever to be 1 at
+   runtime.  Versioning for S == 1 on its own would lead to unnecessary
+   code bloat, while adding S == 1 to more realistic version conditions
+   would lose the optimisation opportunity offered by those other conditions.

>> >> +/* Return true if in principle it is worth versioning an index fragment 
>> >> of
>> >> +   the form:
>> >> +
>> >> + (i * b * SCALE) / FACTOR
>> >> +
>> >> +   for the case in which b == 1.  */
>> >> +
>> >> +bool
>> >> +loop_versioning::acceptable_scale_p (tree scale, poly_uint64 factor)
>> >> +{
>> >> +  /* See whether SCALE is a constant multiple of FACTOR, and if the
>> >> + multiple is small enough for us to treat it as a potential grouped
>> >> + access.  For example:
>> >> +
>> >> +   for (auto i : ...)
>> >> +  y[i] = f (x[4 * i * stride],
>> >> +x[4 * i * stride + 1],
>> >> +x[4 * i * stride + 2]);
>> >> +
>> >> + would benefit from versioning for the case in which stride == 1.
>> >> + High multiples of i * stride are less likely to benefit, and could
>> >> + indicate a simulated multi-dimensional array.
>> >> +
>> >> + This is just a heuristic, to avoid having to do expensive group
>> >> + analysis of the data references in a loop.  */
>> >> +  poly_uint64 const_scale;
>> >> +  unsigned int multiple;
>> >> +  if (poly_int_tree_p (scale, _scale)
>> >> +  && constant_multiple_p (const_scale, factor, ))
>> >> +{
>> >> +  unsigned int maxval = PARAM_VALUE 
>> >> (PARAM_LOOP_VERSIONING_GROUP_SIZE);
>> >> +  return IN_RANGE (multiple, 1, maxval);
>> >
>> > Hmm.  So you _do_ want to version sth like
>> >
>> > struct X { int i; int j; } a[2048];
>> >
>> > for (int i = start; i < end; ++i)
>> >   a[i*s].i = 1;
>> >
>> > ?  That is, even with s == 1 the accesses will not become contiguous?
>> > OK, I suppose that's because you are looking at a stmt in isolation
>> > and another stmt may access a[i*s].j here.
>> >
>> > That is, would it be a future improvement to run sth like the
>> > vectorizers group access analysis on the references and perform
>> > this check on whole groups then, possibly better being able to
>> > constrain what is now the magic parameter PARAM_LOOP_VERSIONING_GROUP_SIZE?
>> 
>> Yeah, possibly.  The problem is that we might end up having to reproduce
>> vectoriser heuristics.  E.g. stores with gaps should be vectorisable
>> in future with SVE, using contiguous predicated stores in which some
>> lanes are inactive.  So we don't necessarily need the .j access for
>> this to be worthwhile.
>> 
>> But perhaps the argument for versioning for vectorisation is stronger
>> for grouped accesses than contiguous ones.
>> 
>> Note that the above example doesn't rely on the grouping heuristic
>> since we just strip the COMPONENT_REF and then analyze the ARRAY_REF
>> with a factor of 1.  The heuristic instead handles things like:
>> 
>>   a[i * stride * 2]
>> 
>> etc., and the param is more there to decide when a constant factor is
>> small enough to be a realistic group size instead of an outer dimension.
>> E.g. if we see:
>> 
>>  

Re: Add a loop versioning pass

2018-12-03 Thread Richard Biener
On Wed, 28 Nov 2018, Richard Sandiford wrote:

> Thanks for the review and sorry for the long time replying.

Likewise ...

> Richard Biener  writes:
> >> This patch adds a pass that versions loops with variable index strides
> >> for the case in which the stride is 1.  E.g.:
> >> 
> >> for (int i = 0; i < n; ++i)
> >>   x[i * stride] = ...;
> >> 
> >> becomes:
> >> 
> >> if (stepx == 1)
> >>   for (int i = 0; i < n; ++i)
> >> x[i] = ...;
> >> else
> >>   for (int i = 0; i < n; ++i)
> >> x[i * stride] = ...;
> >> 
> >> This is useful for both vector code and scalar code, and in some cases
> >> can enable further optimisations like loop interchange or pattern
> >> recognition.
> >> 
> >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> >> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> >> that regress.
> >> 
> >> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> >> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> >> the whole program is written using assumed-shape arrays and pointers,
> >> so a large number of functions really do benefit from versioning.
> >> roms likewise makes heavy use of assumed-shape arrays, and that
> >> improvement in performance IMO justifies the code growth.
> >
> > Ouch.  I know that at least with LTO IPA-CP can do "quite" some
> > propagation of constant strides.  Not sure if we're aggressive
> > enough in actually doing the cloning for all cases we figure out
> > strides though.  But my question is how we can avoid doing the
> > versioning for loops in the copy that did not have the IPA-CPed
> > stride of one?  Ideally we'd be able to mark individual references
> > as {definitely,likely,unlikely,not}-unit-stride?
> 
> This is a more elaborate version of what I was trying to do with
> the IFN I'd mentioned on IRC a few weeks back.  It would be a bit
> like a long-living ASSERT_EXPR that provides hints about the value
> of the SSA name.  Another alternative would be to annotate the
> SSA name itself, but then we'd easily lose the information during
> optimisation.
> 
> But will IPA-CP conditionally use a clone for a stride of 1 for
> calls that pass a variable stride that might or might not be 1?
> E.g. if it clones foo as foo.1 for calls in which a stride parameter
> is 1 at compile time, does it also make foo call foo.1 when the stride
> parameter is 1 at runtime?  If not, that sounds like a missed optimisation.
> If so, prune_conditions should stop us from versioning.

Martin answered that.

> >> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> >> a small (0.4%) speed improvement there, but although both 3-iteration runs
> >> produced stable results, that might still be noise.  There was a slightly
> >> larger (non-noise) improvement for a 256-bit SVE model.
> >> 
> >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> >> without any noticeable improvement in performance.  No other test grew
> >> by more than 2%.
> >> 
> >> Although the main SPEC beneficiaries are all Fortran tests, the
> >> benchmarks we use for SVE also include some C and C++ tests that
> >> benefit.
> >
> > Did you see any slowdown, for example because versioning was forced
> > to be on an innermost loop?  I'm thinking of the testcase in
> > PR87561 where we do have strided accesses in the innermost loop.
> 
> Not so far.
> 
> > Since you cite performance numbers how did you measure them?
> > I assume -Ofast -march=native but did you check with -flto?
> 
> -O3 -march=native.  No, didn't check LTO, will do that.
> 
> >> Using -frepack-arrays gives the same benefits in many Fortran cases.
> >> The problem is that using that option inappropriately can force a full
> >> array copy for arguments that the function only reads once, and so it
> >> isn't really something we can turn on by default.  The new pass is
> >> supposed to give most of the benefits of -frepack-arrays without
> >> the risk of unnecessary repacking.
> >> 
> >> The patch therefore enables the pass by default at -O3.
> >
> > I think that's reasonable.
> >
> > One possible enhancement would be to add a value-profile for the
> > strides so we can guide this optimization better.
> >
> > The pass falls foul of C++ class make small methods of everything.
> > That makes following the code very hard.  Please inline single-used
> > methods in callers wherever possible to make the code read
> > more like GCC code (using GCC API).
> 
> I'll bite my tongue on this one :-)
> 
> > The pass contains an awful lot of heuristics :/  Like last year
> > with the interchange pass I would suggest to rip most of it out
> > and first lay infrastructure with the cases you can positively
> > identify without applying heuristics or "hacks" like stripping
> > semantically required casts.  That makes it also clear which
> > testcases test which code-path.  That said, all the analyze
> > 

Re: Add a loop versioning pass

2018-11-29 Thread Martin Jambor
Hi,

On Wed, Nov 28 2018, Richard Sandiford wrote:
>>> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
>>> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
>>> that regress.
>>> 
>>> Sizewise, there's a 10% increase in .text for both 554.roms_r and
>>> 465.tonto.  That's obviously a lot, but in tonto's case it's because
>>> the whole program is written using assumed-shape arrays and pointers,
>>> so a large number of functions really do benefit from versioning.
>>> roms likewise makes heavy use of assumed-shape arrays, and that
>>> improvement in performance IMO justifies the code growth.
>>
>> Ouch.  I know that at least with LTO IPA-CP can do "quite" some
>> propagation of constant strides.  Not sure if we're aggressive
>> enough in actually doing the cloning for all cases we figure out
>> strides though.  But my question is how we can avoid doing the
>> versioning for loops in the copy that did not have the IPA-CPed
>> stride of one?  Ideally we'd be able to mark individual references
>> as {definitely,likely,unlikely,not}-unit-stride?
>
> This is a more elaborate version of what I was trying to do with
> the IFN I'd mentioned on IRC a few weeks back.  It would be a bit
> like a long-living ASSERT_EXPR that provides hints about the value
> of the SSA name.  Another alternative would be to annotate the
> SSA name itself, but then we'd easily lose the information during
> optimisation.
>
> But will IPA-CP conditionally use a clone for a stride of 1 for
> calls that pass a variable stride that might or might not be 1?
> E.g. if it clones foo as foo.1 for calls in which a stride parameter
> is 1 at compile time, does it also make foo call foo.1 when the stride
> parameter is 1 at runtime?  If not, that sounds like a missed optimisation.
> If so, prune_conditions should stop us from versioning.

IPA-CP only creates clones for compile-time constants, it does not add
any run-time check to the callers.  It is an interesting idea that it
could add these for some cases when it decides to do the cloning
anyway.

Martin


Re: Add a loop versioning pass

2018-11-28 Thread Richard Sandiford
Thanks for the review and sorry for the long time replying.

Richard Biener  writes:
>> This patch adds a pass that versions loops with variable index strides
>> for the case in which the stride is 1.  E.g.:
>> 
>> for (int i = 0; i < n; ++i)
>>   x[i * stride] = ...;
>> 
>> becomes:
>> 
>> if (stepx == 1)
>>   for (int i = 0; i < n; ++i)
>> x[i] = ...;
>> else
>>   for (int i = 0; i < n; ++i)
>> x[i * stride] = ...;
>> 
>> This is useful for both vector code and scalar code, and in some cases
>> can enable further optimisations like loop interchange or pattern
>> recognition.
>> 
>> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
>> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
>> that regress.
>> 
>> Sizewise, there's a 10% increase in .text for both 554.roms_r and
>> 465.tonto.  That's obviously a lot, but in tonto's case it's because
>> the whole program is written using assumed-shape arrays and pointers,
>> so a large number of functions really do benefit from versioning.
>> roms likewise makes heavy use of assumed-shape arrays, and that
>> improvement in performance IMO justifies the code growth.
>
> Ouch.  I know that at least with LTO IPA-CP can do "quite" some
> propagation of constant strides.  Not sure if we're aggressive
> enough in actually doing the cloning for all cases we figure out
> strides though.  But my question is how we can avoid doing the
> versioning for loops in the copy that did not have the IPA-CPed
> stride of one?  Ideally we'd be able to mark individual references
> as {definitely,likely,unlikely,not}-unit-stride?

This is a more elaborate version of what I was trying to do with
the IFN I'd mentioned on IRC a few weeks back.  It would be a bit
like a long-living ASSERT_EXPR that provides hints about the value
of the SSA name.  Another alternative would be to annotate the
SSA name itself, but then we'd easily lose the information during
optimisation.

But will IPA-CP conditionally use a clone for a stride of 1 for
calls that pass a variable stride that might or might not be 1?
E.g. if it clones foo as foo.1 for calls in which a stride parameter
is 1 at compile time, does it also make foo call foo.1 when the stride
parameter is 1 at runtime?  If not, that sounds like a missed optimisation.
If so, prune_conditions should stop us from versioning.

>> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
>> a small (0.4%) speed improvement there, but although both 3-iteration runs
>> produced stable results, that might still be noise.  There was a slightly
>> larger (non-noise) improvement for a 256-bit SVE model.
>> 
>> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
>> without any noticeable improvement in performance.  No other test grew
>> by more than 2%.
>> 
>> Although the main SPEC beneficiaries are all Fortran tests, the
>> benchmarks we use for SVE also include some C and C++ tests that
>> benefit.
>
> Did you see any slowdown, for example because versioning was forced
> to be on an innermost loop?  I'm thinking of the testcase in
> PR87561 where we do have strided accesses in the innermost loop.

Not so far.

> Since you cite performance numbers how did you measure them?
> I assume -Ofast -march=native but did you check with -flto?

-O3 -march=native.  No, didn't check LTO, will do that.

>> Using -frepack-arrays gives the same benefits in many Fortran cases.
>> The problem is that using that option inappropriately can force a full
>> array copy for arguments that the function only reads once, and so it
>> isn't really something we can turn on by default.  The new pass is
>> supposed to give most of the benefits of -frepack-arrays without
>> the risk of unnecessary repacking.
>> 
>> The patch therefore enables the pass by default at -O3.
>
> I think that's reasonable.
>
> One possible enhancement would be to add a value-profile for the
> strides so we can guide this optimization better.
>
> The pass falls foul of C++ class make small methods of everything.
> That makes following the code very hard.  Please inline single-used
> methods in callers wherever possible to make the code read
> more like GCC code (using GCC API).

I'll bite my tongue on this one :-)

> The pass contains an awful lot of heuristics :/  Like last year
> with the interchange pass I would suggest to rip most of it out
> and first lay infrastructure with the cases you can positively
> identify without applying heuristics or "hacks" like stripping
> semantically required casts.  That makes it also clear which
> testcases test which code-path.  That said, all the analyze
> multiplications/plusses/factors stuff was extremely hard to review
> and I have no overall picture why this is all so complicated or
> necessary.

I think the tests do cover most of this -- was glad to see that
when Kyrill tried taking something out, one of the tests started
to fail :-).  And the comments in 

Re: Add a loop versioning pass

2018-11-28 Thread Richard Sandiford
David Malcolm  writes:
> On Wed, 2018-10-24 at 14:05 +0100, Richard Sandiford wrote:
>> This patch adds a pass that versions loops with variable index
>> strides
>> for the case in which the stride is 1.  E.g.:
>> 
>> for (int i = 0; i < n; ++i)
>>   x[i * stride] = ...;
>> 
>> becomes:
>> 
>> if (stepx == 1)
>>   for (int i = 0; i < n; ++i)
>> x[i] = ...;
>> else
>>   for (int i = 0; i < n; ++i)
>> x[i * stride] = ...;
>> 
>> This is useful for both vector code and scalar code, and in some
>> cases
>> can enable further optimisations like loop interchange or pattern
>> recognition.
>> 
>> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
>> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
>> that regress.
>> 
>> Sizewise, there's a 10% increase in .text for both 554.roms_r and
>> 465.tonto.  That's obviously a lot, but in tonto's case it's because
>> the whole program is written using assumed-shape arrays and pointers,
>> so a large number of functions really do benefit from versioning.
>> roms likewise makes heavy use of assumed-shape arrays, and that
>> improvement in performance IMO justifies the code growth.
>> 
>> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did
>> see
>> a small (0.4%) speed improvement there, but although both 3-iteration 
>> runs
>> produced stable results, that might still be noise.  There was a
>> slightly
>> larger (non-noise) improvement for a 256-bit SVE model.
>> 
>> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
>> without any noticeable improvement in performance.  No other test
>> grew
>> by more than 2%.
>> 
>> Although the main SPEC beneficiaries are all Fortran tests, the
>> benchmarks we use for SVE also include some C and C++ tests that
>> benefit.
>> 
>> Using -frepack-arrays gives the same benefits in many Fortran cases.
>> The problem is that using that option inappropriately can force a
>> full
>> array copy for arguments that the function only reads once, and so it
>> isn't really something we can turn on by default.  The new pass is
>> supposed to give most of the benefits of -frepack-arrays without
>> the risk of unnecessary repacking.
>> 
>> The patch therefore enables the pass by default at -O3.
>> 
>> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> 
>> Richard
>> 
>
> [...snip...]
>
>> +/* Run the pass and return a set of TODO_* flags.  */
>> +
>> +unsigned int
>> +loop_versioning::run ()
>> +{
>> +  gcc_assert (scev_initialized_p ());
>> +
>> +  if (!analyze_blocks ()
>> +  || !prune_conditions ()
>> +  || !make_versioning_decisions ()
>> +  || !implement_versioning_decisions ())
>> +return 0;
>> +
>> +  return TODO_update_ssa;
>> +}
>> +
>> +/* Loop versioningting pass.  */
>
> (typo)

Huh, no idea how I even got there.

>> +
>> +namespace {
>
> Could the whole file be within this anonymous namespace, rather than
> just the opt_pass subclass?  (hiding class loop_versioning, so that the
> optimizer knows that the only thing visible outside the TU is
> make_pass_loop_versioning).  This can be a pain to debug, but you can
> always comment out the anon namespace locally when debugging.

Yeah, I prefer that style too, so that the TU only exports public
interfaces.  I'd got the impression from earlier threads that this
was frowned on for GCC and so I was deliberately avoiding it, but if
it's OK then great.

Thanks,
Richard


Re: Add a loop versioning pass

2018-10-30 Thread Richard Biener


(sorry for breaking threading -- I composed a review mail offline but
gmail has no way of nicely sending that neither has it a way to bounce
messages...)

> This patch adds a pass that versions loops with variable index strides
> for the case in which the stride is 1.  E.g.:
> 
> for (int i = 0; i < n; ++i)
>   x[i * stride] = ...;
> 
> becomes:
> 
> if (stepx == 1)
>   for (int i = 0; i < n; ++i)
> x[i] = ...;
> else
>   for (int i = 0; i < n; ++i)
> x[i * stride] = ...;
> 
> This is useful for both vector code and scalar code, and in some cases
> can enable further optimisations like loop interchange or pattern
> recognition.
> 
> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> that regress.
> 
> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> the whole program is written using assumed-shape arrays and pointers,
> so a large number of functions really do benefit from versioning.
> roms likewise makes heavy use of assumed-shape arrays, and that
> improvement in performance IMO justifies the code growth.

Ouch.  I know that at least with LTO IPA-CP can do "quite" some
propagation of constant strides.  Not sure if we're aggressive
enough in actually doing the cloning for all cases we figure out
strides though.  But my question is how we can avoid doing the
versioning for loops in the copy that did not have the IPA-CPed
stride of one?  Ideally we'd be able to mark individual references
as {definitely,likely,unlikely,not}-unit-stride?

> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> a small (0.4%) speed improvement there, but although both 3-iteration runs
> produced stable results, that might still be noise.  There was a slightly
> larger (non-noise) improvement for a 256-bit SVE model.
> 
> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> without any noticeable improvement in performance.  No other test grew
> by more than 2%.
> 
> Although the main SPEC beneficiaries are all Fortran tests, the
> benchmarks we use for SVE also include some C and C++ tests that
> benefit.

Did you see any slowdown, for example because versioning was forced
to be on an innermost loop?  I'm thinking of the testcase in
PR87561 where we do have strided accesses in the innermost loop.

Since you cite performance numbers how did you measure them?
I assume -Ofast -march=native but did you check with -flto?

> Using -frepack-arrays gives the same benefits in many Fortran cases.
> The problem is that using that option inappropriately can force a full
> array copy for arguments that the function only reads once, and so it
> isn't really something we can turn on by default.  The new pass is
> supposed to give most of the benefits of -frepack-arrays without
> the risk of unnecessary repacking.
> 
> The patch therefore enables the pass by default at -O3.

I think that's reasonable.

One possible enhancement would be to add a value-profile for the
strides so we can guide this optimization better.

The pass falls foul of C++ class make small methods of everything.
That makes following the code very hard.  Please inline single-used
methods in callers wherever possible to make the code read
more like GCC code (using GCC API).

The pass contains an awful lot of heuristics :/  Like last year
with the interchange pass I would suggest to rip most of it out
and first lay infrastructure with the cases you can positively
identify without applying heuristics or "hacks" like stripping
semantically required casts.  That makes it also clear which
testcases test which code-path.  That said, all the analyze
multiplications/plusses/factors stuff was extremely hard to review
and I have no overall picture why this is all so complicated or
necessary.

> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> 
> Richard
> 
> 
> 2018-10-24  Richard Sandiford  
> 
> gcc/
>   * doc/invoke.texi (-fversion-loops-for-strides): Document
>   (loop-versioning-group-size, loop-versioning-max-inner-insns)
>   (loop-versioning-max-outer-insns): Document new --params.
>   * Makefile.in (OBJS): Add gimple-loop-versioning.o.
>   * common.opt (fversion-loops-for-strides): New option.
>   * opts.c (default_options_table): Enable fversion-loops-for-strides
>   at -O3.
>   * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
>   (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
>   (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
>   * passes.def: Add pass_loop_versioning.
>   * timevar.def (TV_LOOP_VERSIONING): New time variable.
>   * tree-ssa-propagate.h
>   (substitute_and_fold_engine::substitute_and_fold): Add an optional
>   block parameter.
>   * tree-ssa-propagate.c
>   (substitute_and_fold_engine::substitute_and_fold): 

Re: Add a loop versioning pass

2018-10-26 Thread Richard Biener
On Thu, Oct 25, 2018 at 3:52 PM Richard Biener
 wrote:
>
> On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford
>  wrote:
> >
> > This patch adds a pass that versions loops with variable index strides
> > for the case in which the stride is 1.  E.g.:
> >
> > for (int i = 0; i < n; ++i)
> >   x[i * stride] = ...;
> >
> > becomes:
> >
> > if (stepx == 1)
> >   for (int i = 0; i < n; ++i)
> > x[i] = ...;
> > else
> >   for (int i = 0; i < n; ++i)
> > x[i * stride] = ...;
> >
> > This is useful for both vector code and scalar code, and in some cases
> > can enable further optimisations like loop interchange or pattern
> > recognition.
> >
> > The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> > and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> > that regress.
> >
> > Sizewise, there's a 10% increase in .text for both 554.roms_r and
> > 465.tonto.  That's obviously a lot, but in tonto's case it's because
> > the whole program is written using assumed-shape arrays and pointers,
> > so a large number of functions really do benefit from versioning.
> > roms likewise makes heavy use of assumed-shape arrays, and that
> > improvement in performance IMO justifies the code growth.
> >
> > The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> > a small (0.4%) speed improvement there, but although both 3-iteration runs
> > produced stable results, that might still be noise.  There was a slightly
> > larger (non-noise) improvement for a 256-bit SVE model.
> >
> > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> > without any noticeable improvement in performance.  No other test grew
> > by more than 2%.
> >
> > Although the main SPEC beneficiaries are all Fortran tests, the
> > benchmarks we use for SVE also include some C and C++ tests that
> > benefit.
> >
> > Using -frepack-arrays gives the same benefits in many Fortran cases.
> > The problem is that using that option inappropriately can force a full
> > array copy for arguments that the function only reads once, and so it
> > isn't really something we can turn on by default.  The new pass is
> > supposed to give most of the benefits of -frepack-arrays without
> > the risk of unnecessary repacking.
> >
> > The patch therefore enables the pass by default at -O3.
> >
> > Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>
> I will give this a thorough review tomorror (sorry for the delay),

So I didn't really finish but one thing I noticed is

> +  if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> +   fprintf (dump_file, ";; Want to version loop %d (depth %d)"
> +" for when ", loop->num, loop_depth (loop));
> +   print_generic_expr (dump_file, name, TDF_SLIM);
> +   fprintf (dump_file, " == 1");

Since you are writing a new pass you want to use the new dump interface.

   if (dump_enabled_p ())
 dump_printf (MSG_NOTE, ";; Want to version loop %d (depth %d)"
 " for when %E == 1", loop->num, loop_depth (loop), name);
...

it's much nicer to be able to use %E/%G than separate calls for the
tree parts.

I'm also cut my overall comment part but not the incomplete
set of individual comments/questions yet (hope to finish that on Monday):

> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> the whole program is written using assumed-shape arrays and pointers,
> so a large number of functions really do benefit from versioning.
> roms likewise makes heavy use of assumed-shape arrays, and that
> improvement in performance IMO justifies the code growth.

Ouch.  I know that at least with LTO IPA-CP can do "quite" some
propagation of constant strides.  Not sure if we're aggressive
enough in actually doing the cloning for all cases we figure out
strides though.  But my question is how we can avoid doing the
versioning for loops in the copy that did not have the IPA-CPed
stride of one?  Ideally we'd be able to mark individual references
as {definitely,likely,unlikely,not}-unit-stride?

> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> a small (0.4%) speed improvement there, but although both 3-iteration runs
> produced stable results, that might still be noise.  There was a slightly
> larger (non-noise) improvement for a 256-bit SVE model.
>
> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> without any noticeable improvement in performance.  No other test grew
> by more than 2%.
>
> Although the main SPEC beneficiaries are all Fortran tests, the
> benchmarks we use for SVE also include some C and C++ tests that
> benefit.

Did you see any slowdown, for example because versioning was forced
to be on an innermost loop?  I'm thinking of the testcase in
PR87561 where we do have strided accesses in the innermost loop.

Since you cite performance numbers how did 

Re: Add a loop versioning pass

2018-10-25 Thread David Malcolm
On Wed, 2018-10-24 at 14:05 +0100, Richard Sandiford wrote:
> This patch adds a pass that versions loops with variable index
> strides
> for the case in which the stride is 1.  E.g.:
> 
> for (int i = 0; i < n; ++i)
>   x[i * stride] = ...;
> 
> becomes:
> 
> if (stepx == 1)
>   for (int i = 0; i < n; ++i)
> x[i] = ...;
> else
>   for (int i = 0; i < n; ++i)
> x[i * stride] = ...;
> 
> This is useful for both vector code and scalar code, and in some
> cases
> can enable further optimisations like loop interchange or pattern
> recognition.
> 
> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> that regress.
> 
> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> the whole program is written using assumed-shape arrays and pointers,
> so a large number of functions really do benefit from versioning.
> roms likewise makes heavy use of assumed-shape arrays, and that
> improvement in performance IMO justifies the code growth.
> 
> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did
> see
> a small (0.4%) speed improvement there, but although both 3-iteration 
> runs
> produced stable results, that might still be noise.  There was a
> slightly
> larger (non-noise) improvement for a 256-bit SVE model.
> 
> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> without any noticeable improvement in performance.  No other test
> grew
> by more than 2%.
> 
> Although the main SPEC beneficiaries are all Fortran tests, the
> benchmarks we use for SVE also include some C and C++ tests that
> benefit.
> 
> Using -frepack-arrays gives the same benefits in many Fortran cases.
> The problem is that using that option inappropriately can force a
> full
> array copy for arguments that the function only reads once, and so it
> isn't really something we can turn on by default.  The new pass is
> supposed to give most of the benefits of -frepack-arrays without
> the risk of unnecessary repacking.
> 
> The patch therefore enables the pass by default at -O3.
> 
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> 
> Richard
> 

[...snip...]

> +/* Run the pass and return a set of TODO_* flags.  */
> +
> +unsigned int
> +loop_versioning::run ()
> +{
> +  gcc_assert (scev_initialized_p ());
> +
> +  if (!analyze_blocks ()
> +  || !prune_conditions ()
> +  || !make_versioning_decisions ()
> +  || !implement_versioning_decisions ())
> +return 0;
> +
> +  return TODO_update_ssa;
> +}
> +
> +/* Loop versioningting pass.  */

(typo)

> +
> +namespace {

Could the whole file be within this anonymous namespace, rather than
just the opt_pass subclass?  (hiding class loop_versioning, so that the
optimizer knows that the only thing visible outside the TU is
make_pass_loop_versioning).  This can be a pain to debug, but you can
always comment out the anon namespace locally when debugging.

> +
> +const pass_data pass_data_loop_versioning =
> +{
> +  GIMPLE_PASS, /* type */
> +  "lversion", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_LOOP_VERSIONING, /* tv_id */
> +  PROP_cfg, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
[...snip...]

> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_loop_versioning (gcc::context *ctxt)
> +{
> +  return new pass_loop_versioning (ctxt);
> +}

[...snip...]



Re: Add a loop versioning pass

2018-10-25 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford
>  wrote:
>>
>> This patch adds a pass that versions loops with variable index strides
>> for the case in which the stride is 1.  E.g.:
>>
>> for (int i = 0; i < n; ++i)
>>   x[i * stride] = ...;
>>
>> becomes:
>>
>> if (stepx == 1)
>>   for (int i = 0; i < n; ++i)
>> x[i] = ...;
>> else
>>   for (int i = 0; i < n; ++i)
>> x[i * stride] = ...;
>>
>> This is useful for both vector code and scalar code, and in some cases
>> can enable further optimisations like loop interchange or pattern
>> recognition.
>>
>> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
>> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
>> that regress.
>>
>> Sizewise, there's a 10% increase in .text for both 554.roms_r and
>> 465.tonto.  That's obviously a lot, but in tonto's case it's because
>> the whole program is written using assumed-shape arrays and pointers,
>> so a large number of functions really do benefit from versioning.
>> roms likewise makes heavy use of assumed-shape arrays, and that
>> improvement in performance IMO justifies the code growth.
>>
>> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
>> a small (0.4%) speed improvement there, but although both 3-iteration runs
>> produced stable results, that might still be noise.  There was a slightly
>> larger (non-noise) improvement for a 256-bit SVE model.
>>
>> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
>> without any noticeable improvement in performance.  No other test grew
>> by more than 2%.
>>
>> Although the main SPEC beneficiaries are all Fortran tests, the
>> benchmarks we use for SVE also include some C and C++ tests that
>> benefit.
>>
>> Using -frepack-arrays gives the same benefits in many Fortran cases.
>> The problem is that using that option inappropriately can force a full
>> array copy for arguments that the function only reads once, and so it
>> isn't really something we can turn on by default.  The new pass is
>> supposed to give most of the benefits of -frepack-arrays without
>> the risk of unnecessary repacking.
>>
>> The patch therefore enables the pass by default at -O3.
>>
>> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>
> I will give this a thorough review tomorror (sorry for the delay), but
> one question that comes up is how we avoid excessive versioning.
> Consider this pass versioning for stride1, then loop distribution
> versioning for aliasing and then vectorization versioning for alignment.

Yeah, this could happen. :-)

> Do we have any idea on how to track these versionings so we can
> eventually commonize paths that didn't get any followup optimization
> applied?

The idea was that this pass would help even for scalar code (due to
cheaper addressing), so it's not something we should necessarily undo
if there is no follow-on optimisation.

> Also consider the case where loop distribution applies transforms
> with versioning for aliasing to _both_ loop nests from your versioning.

I think that's a good thing though.  The aliasing checks are going to be
cheaper for a stride of 1 compared to a variable stride.

> Maybe we can at least add some loop->number_of_times_versioned
> counter and some dumping/statistics to get an idea how big an
> issue this might be?

That might help.  If we could be more confident that the "faster"
loop versions are very likely to be used, we could throttle the
optimisation of the unlikely versions.

But so far I haven't seen an example where this really hurts.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Richard
>>
>>
>> 2018-10-24  Richard Sandiford  
>>
>> gcc/
>> * doc/invoke.texi (-fversion-loops-for-strides): Document
>> (loop-versioning-group-size, loop-versioning-max-inner-insns)
>> (loop-versioning-max-outer-insns): Document new --params.
>> * Makefile.in (OBJS): Add gimple-loop-versioning.o.
>> * common.opt (fversion-loops-for-strides): New option.
>> * opts.c (default_options_table): Enable fversion-loops-for-strides
>> at -O3.
>> * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
>> (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
>> (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
>> * passes.def: Add pass_loop_versioning.
>> * timevar.def (TV_LOOP_VERSIONING): New time variable.
>> * tree-ssa-propagate.h
>> (substitute_and_fold_engine::substitute_and_fold): Add an optional
>> block parameter.
>> * tree-ssa-propagate.c
>> (substitute_and_fold_engine::substitute_and_fold): Likewise.
>> When passed, only walk blocks dominated by that block.
>> * tree-vrp.h (range_includes_p): Declare.
>> (range_includes_zero_p): Turn into an inline wrapper around
>> range_includes_p.
>> * tree-vrp.c (range_includes_p): 

Re: Add a loop versioning pass

2018-10-25 Thread Richard Biener
On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford
 wrote:
>
> This patch adds a pass that versions loops with variable index strides
> for the case in which the stride is 1.  E.g.:
>
> for (int i = 0; i < n; ++i)
>   x[i * stride] = ...;
>
> becomes:
>
> if (stepx == 1)
>   for (int i = 0; i < n; ++i)
> x[i] = ...;
> else
>   for (int i = 0; i < n; ++i)
> x[i * stride] = ...;
>
> This is useful for both vector code and scalar code, and in some cases
> can enable further optimisations like loop interchange or pattern
> recognition.
>
> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> that regress.
>
> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> the whole program is written using assumed-shape arrays and pointers,
> so a large number of functions really do benefit from versioning.
> roms likewise makes heavy use of assumed-shape arrays, and that
> improvement in performance IMO justifies the code growth.
>
> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> a small (0.4%) speed improvement there, but although both 3-iteration runs
> produced stable results, that might still be noise.  There was a slightly
> larger (non-noise) improvement for a 256-bit SVE model.
>
> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> without any noticeable improvement in performance.  No other test grew
> by more than 2%.
>
> Although the main SPEC beneficiaries are all Fortran tests, the
> benchmarks we use for SVE also include some C and C++ tests that
> benefit.
>
> Using -frepack-arrays gives the same benefits in many Fortran cases.
> The problem is that using that option inappropriately can force a full
> array copy for arguments that the function only reads once, and so it
> isn't really something we can turn on by default.  The new pass is
> supposed to give most of the benefits of -frepack-arrays without
> the risk of unnecessary repacking.
>
> The patch therefore enables the pass by default at -O3.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

I will give this a thorough review tomorror (sorry for the delay), but
one question that comes up is how we avoid excessive versioning.
Consider this pass versioning for stride1, then loop distribution
versioning for aliasing and then vectorization versioning for alignment.

Do we have any idea on how to track these versionings so we can
eventually commonize paths that didn't get any followup optimization
applied?

Also consider the case where loop distribution applies transforms
with versioning for aliasing to _both_ loop nests from your versioning.

Maybe we can at least add some loop->number_of_times_versioned
counter and some dumping/statistics to get an idea how big an
issue this might be?

Thanks,
Richard.

> Richard
>
>
> 2018-10-24  Richard Sandiford  
>
> gcc/
> * doc/invoke.texi (-fversion-loops-for-strides): Document
> (loop-versioning-group-size, loop-versioning-max-inner-insns)
> (loop-versioning-max-outer-insns): Document new --params.
> * Makefile.in (OBJS): Add gimple-loop-versioning.o.
> * common.opt (fversion-loops-for-strides): New option.
> * opts.c (default_options_table): Enable fversion-loops-for-strides
> at -O3.
> * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
> (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
> (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
> * passes.def: Add pass_loop_versioning.
> * timevar.def (TV_LOOP_VERSIONING): New time variable.
> * tree-ssa-propagate.h
> (substitute_and_fold_engine::substitute_and_fold): Add an optional
> block parameter.
> * tree-ssa-propagate.c
> (substitute_and_fold_engine::substitute_and_fold): Likewise.
> When passed, only walk blocks dominated by that block.
> * tree-vrp.h (range_includes_p): Declare.
> (range_includes_zero_p): Turn into an inline wrapper around
> range_includes_p.
> * tree-vrp.c (range_includes_p): New function, generalizing...
> (range_includes_zero_p): ...this.
> * tree-pass.h (make_pass_loop_versioning): Declare.
> * gimple-loop-versioning.cc: New file.
>
> gcc/testsuite/
> * gcc.dg/loop-versioning-1.c: New test.
> * gcc.dg/loop-versioning-10.c: Likewise.
> * gcc.dg/loop-versioning-11.c: Likewise.
> * gcc.dg/loop-versioning-2.c: Likewise.
> * gcc.dg/loop-versioning-3.c: Likewise.
> * gcc.dg/loop-versioning-4.c: Likewise.
> * gcc.dg/loop-versioning-5.c: Likewise.
> * gcc.dg/loop-versioning-6.c: Likewise.
> * gcc.dg/loop-versioning-7.c: Likewise.
> * gcc.dg/loop-versioning-8.c: Likewise.
> *