Re: Add a loop versioning pass
On December 14, 2018 5:18:38 PM GMT+01:00, Richard Sandiford wrote: >Richard Biener writes: >> On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford > wrote: >>>Richard Biener writes: On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford > Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. > Also repeated the performance testing (but haven't yet tried an > LTO variant; will do that over the weekend). Any results? >>> >>>Sorry, I should've remembered that finding time to run tests is easy, >>>finding time to analyse them is hard. >>> >>>Speed-wise, the impact of the patch for LTO is similar to without, >>>with 554.roms_r being the main beneficiary for both AArch64 and >x86_64. >>>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native >>>-flto=jobserver. >>> >>>Size-wise, there are three tests that grow by >=2% on x86_64: >>> >>>549.fotonik3d_r: 5.5% >>>548.exchange2_r: 29.5% >>>554.roms_r: 39.6% >> >> Uh. With LTO we might have a reasonable guessed profile and you do >have a optimize_loop_nest_for_speed guard on the transform? > >Guard now added :-) But unfortunately it doesn't make any significant >difference. 548.exchange2_r goes from 29.5% to 27.7%, but the other >two are the same as before. > >> How does compile time fare with the above benchmarks? > >For 554.roms_r it's +80%(!) with -flto=1, but I think that's par for >the course given the increase in function sizes. :( >For 549.fotonik3d_r it's +5% with -flto=1. > >For 503.bwaves_r (as an example of a benchmark whose size doesn't >change), >the difference is in the noise. > >[...] > >>>You mean something like: >>> >>> real :: foo(:,:), bar(:) >>> >>> do i=1,n >>>do j=1,n >>> foo(i,j) = ... >>>end do >>>bar(i) = .. >>> end do >>> >>>? I can add a test if so. >> >> Please. > >OK, I've added them to loop_versioning_4.f90. > There may also be some subtle issues with substitute_and_fold being applied to non-up-to-date SSA form given it folds stmts looking at (single-use!) SSA edges. The single-use guard might be what saves >>>you here (SSA uses in the copies are not yet updated to point to the copied DEFs). >>> >>>OK. I was hoping that because we only apply substitute_and_fold >>>to new code, there would be no problem with uses elsewhere. >> >> Might be, yes. >> >>>Would it be safer to: >>> >>> - version all loops we want to version >>> - update SSA explicitly >>> - apply substitute and fold to all "new" loops >> >> That would be definitely less fishy. But you need to get at the >actual >> 'new' SSA names for the replacements as I guess they'll be rewritten? >> Or maybe those are not. >> >>>? Could we then get away with returning a 0 TODO at the end? >> >> Yes. > >OK, the updated patch does it this way. > >Tested as before. OK. Thanks, Richard. >Thanks, >Richard > > >2018-12-14 Richard Sandiford > Ramana Radhakrishnan > Kyrylo Tkachov > >gcc/ > * doc/invoke.texi (-fversion-loops-for-strides): Document > (loop-versioning-group-size, loop-versioning-max-inner-insns) > (loop-versioning-max-outer-insns): Document new --params. > * Makefile.in (OBJS): Add gimple-loop-versioning.o. > * common.opt (fversion-loops-for-strides): New option. > * opts.c (default_options_table): Enable fversion-loops-for-strides > at -O3. > * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) > (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) > (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. > * passes.def: Add pass_loop_versioning. > * timevar.def (TV_LOOP_VERSIONING): New time variable. > * tree-ssa-propagate.h > (substitute_and_fold_engine::substitute_and_fold): Add an optional > block parameter. > * tree-ssa-propagate.c > (substitute_and_fold_engine::substitute_and_fold): Likewise. > When passed, only walk blocks dominated by that block. > * tree-vrp.h (range_includes_p): Declare. > (range_includes_zero_p): Turn into an inline wrapper around > range_includes_p. > * tree-vrp.c (range_includes_p): New function, generalizing... > (range_includes_zero_p): ...this. > * tree-pass.h (make_pass_loop_versioning): Declare. > * gimple-loop-versioning.cc: New file. > >gcc/testsuite/ > * gcc.dg/loop-versioning-1.c: New test. > * gcc.dg/loop-versioning-10.c: Likewise. > * gcc.dg/loop-versioning-11.c: Likewise. > * gcc.dg/loop-versioning-2.c: Likewise. > * gcc.dg/loop-versioning-3.c: Likewise. > * gcc.dg/loop-versioning-4.c: Likewise. > * gcc.dg/loop-versioning-5.c: Likewise. > * gcc.dg/loop-versioning-6.c: Likewise. > * gcc.dg/loop-versioning-7.c: Likewise. > * gcc.dg/loop-versioning-8.c: Likewise. > * gcc.dg/loop-versioning-9.c: Likewise. > * gfortran.dg/loop_versioning_1.f90: Likewise. > * gfortran.dg/loop_versioning_2.f90:
Re: Add a loop versioning pass
Richard Biener writes: > On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford > wrote: >>Richard Biener writes: >>> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. Also repeated the performance testing (but haven't yet tried an LTO variant; will do that over the weekend). >>> >>> Any results? >> >>Sorry, I should've remembered that finding time to run tests is easy, >>finding time to analyse them is hard. >> >>Speed-wise, the impact of the patch for LTO is similar to without, >>with 554.roms_r being the main beneficiary for both AArch64 and x86_64. >>I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native >>-flto=jobserver. >> >>Size-wise, there are three tests that grow by >=2% on x86_64: >> >>549.fotonik3d_r: 5.5% >>548.exchange2_r: 29.5% >>554.roms_r: 39.6% > > Uh. With LTO we might have a reasonable guessed profile and you do have a > optimize_loop_nest_for_speed guard on the transform? Guard now added :-) But unfortunately it doesn't make any significant difference. 548.exchange2_r goes from 29.5% to 27.7%, but the other two are the same as before. > How does compile time fare with the above benchmarks? For 554.roms_r it's +80%(!) with -flto=1, but I think that's par for the course given the increase in function sizes. For 549.fotonik3d_r it's +5% with -flto=1. For 503.bwaves_r (as an example of a benchmark whose size doesn't change), the difference is in the noise. [...] >>You mean something like: >> >> real :: foo(:,:), bar(:) >> >> do i=1,n >>do j=1,n >> foo(i,j) = ... >>end do >>bar(i) = .. >> end do >> >>? I can add a test if so. > > Please. OK, I've added them to loop_versioning_4.f90. >>> There may also be some subtle issues with substitute_and_fold being >>> applied to non-up-to-date SSA form given it folds stmts looking at >>> (single-use!) SSA edges. The single-use guard might be what saves >>you >>> here (SSA uses in the copies are not yet updated to point to the >>> copied DEFs). >> >>OK. I was hoping that because we only apply substitute_and_fold >>to new code, there would be no problem with uses elsewhere. > > Might be, yes. > >>Would it be safer to: >> >> - version all loops we want to version >> - update SSA explicitly >> - apply substitute and fold to all "new" loops > > That would be definitely less fishy. But you need to get at the actual > 'new' SSA names for the replacements as I guess they'll be rewritten? > Or maybe those are not. > >>? Could we then get away with returning a 0 TODO at the end? > > Yes. OK, the updated patch does it this way. Tested as before. Thanks, Richard 2018-12-14 Richard Sandiford Ramana Radhakrishnan Kyrylo Tkachov gcc/ * doc/invoke.texi (-fversion-loops-for-strides): Document (loop-versioning-group-size, loop-versioning-max-inner-insns) (loop-versioning-max-outer-insns): Document new --params. * Makefile.in (OBJS): Add gimple-loop-versioning.o. * common.opt (fversion-loops-for-strides): New option. * opts.c (default_options_table): Enable fversion-loops-for-strides at -O3. * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. * passes.def: Add pass_loop_versioning. * timevar.def (TV_LOOP_VERSIONING): New time variable. * tree-ssa-propagate.h (substitute_and_fold_engine::substitute_and_fold): Add an optional block parameter. * tree-ssa-propagate.c (substitute_and_fold_engine::substitute_and_fold): Likewise. When passed, only walk blocks dominated by that block. * tree-vrp.h (range_includes_p): Declare. (range_includes_zero_p): Turn into an inline wrapper around range_includes_p. * tree-vrp.c (range_includes_p): New function, generalizing... (range_includes_zero_p): ...this. * tree-pass.h (make_pass_loop_versioning): Declare. * gimple-loop-versioning.cc: New file. gcc/testsuite/ * gcc.dg/loop-versioning-1.c: New test. * gcc.dg/loop-versioning-10.c: Likewise. * gcc.dg/loop-versioning-11.c: Likewise. * gcc.dg/loop-versioning-2.c: Likewise. * gcc.dg/loop-versioning-3.c: Likewise. * gcc.dg/loop-versioning-4.c: Likewise. * gcc.dg/loop-versioning-5.c: Likewise. * gcc.dg/loop-versioning-6.c: Likewise. * gcc.dg/loop-versioning-7.c: Likewise. * gcc.dg/loop-versioning-8.c: Likewise. * gcc.dg/loop-versioning-9.c: Likewise. * gfortran.dg/loop_versioning_1.f90: Likewise. * gfortran.dg/loop_versioning_2.f90: Likewise. * gfortran.dg/loop_versioning_3.f90: Likewise. * gfortran.dg/loop_versioning_4.f90: Likewise. * gfortran.dg/loop_versioning_5.f90: Likewise. *
Re: Add a loop versioning pass
On December 12, 2018 7:43:10 PM GMT+01:00, Richard Sandiford wrote: >Richard Biener writes: >> On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford >>> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. >>> Also repeated the performance testing (but haven't yet tried an >>> LTO variant; will do that over the weekend). >> >> Any results? > >Sorry, I should've remembered that finding time to run tests is easy, >finding time to analyse them is hard. > >Speed-wise, the impact of the patch for LTO is similar to without, >with 554.roms_r being the main beneficiary for both AArch64 and x86_64. >I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native >-flto=jobserver. > >Size-wise, there are three tests that grow by >=2% on x86_64: > >549.fotonik3d_r: 5.5% >548.exchange2_r: 29.5% >554.roms_r: 39.6% Uh. With LTO we might have a reasonable guessed profile and you do have a optimize_loop_nest_for_speed guard on the transform? How does compile time fare with the above benchmarks? >The roms increase seems fair enough given the benefit, since the >whole program uses a similar style for handling arrays. > >fotonik3d is a mixed bag. Some versioning conditions are from >things like: > > subroutine foo(a) >real :: a(:,:,:) >a(1,:,:) = ... > end subroutine > >where we version for the middle dimension having a stride of 1. >This could be eliminated if we had more semantic information. > >Other conditions are for things like: > > subroutine foo(a) >real :: a(:,:,:) >a(:,1,:) = ... > end subroutine > >where the pass really does help, but unfortunately those loops >aren't hot and might not even be used at all. > >Some opportunities are cases in which we avoid gather loads, such as >in the "mp" loads in the hot __material_mod_MOD_mat_updatee function. >But mp feeds a gather load itself and these assignments aren't a >critical part of the loop. > >(I'm testing on an AVX-only system, so these are synthesised gathers >rather than real gathers.) > >The growth in 548.exchange2_r comes from reasonable changes to cold >code. >The test spends almost all its time in __brute_force_MOD_digits_2, >which >contains the same code before and after the patch, but which accounts >for less than 1% of .text before the patch. > >> I've skimmed over the updated patch and it looks >> a lot better now. >> >> +bool >> +loop_versioning >> +::find_per_loop_multiplication (address_info , >address_term_info ) >> +{ >> >> is that what coding convention allows? > >Not sure we've settled on something, so I improvised. > >> For grepping I'd then say we should do >> >> bool loop_versioning:: >> find_per_loop_multiplication (...) >> >> ;) > >Sounds good to me. > >> Anywhere else we you use >> >> loop_versioning::foo >> >> so please stick to that. > >Yeah, I used that when I could avoid an argument spilling over 80 >chars, >but I think I missed that the above function now fits into that >category, >Will double-check the others. > >> Otherwise OK. > >Thanks :-) > >> I think I don't see a testcase where we could version both loops in a >nest >> so I'm not sure whether the transform works fine when you are only >> updating SSA form at the very end of the pass. > >You mean something like: > > real :: foo(:,:), bar(:) > > do i=1,n >do j=1,n > foo(i,j) = ... >end do >bar(i) = .. > end do > >? I can add a test if so. Please. >At the moment the pass only looks for direct versioning opportunities >in inner loops, so the assignment to bar wouldn't be treated as a >versioning opportunity. Ah, OK. We should still hoist the version check >for foo outside both loops, which is tested by loop_versioning_4.f90 >for cases in which the outer loop doesn't have its own array >arithmetic, but isn't tested for cases like the above. > >> There may also be some subtle issues with substitute_and_fold being >> applied to non-up-to-date SSA form given it folds stmts looking at >> (single-use!) SSA edges. The single-use guard might be what saves >you >> here (SSA uses in the copies are not yet updated to point to the >> copied DEFs). > >OK. I was hoping that because we only apply substitute_and_fold >to new code, there would be no problem with uses elsewhere. Might be, yes. >Would it be safer to: > > - version all loops we want to version > - update SSA explicitly > - apply substitute and fold to all "new" loops That would be definitely less fishy. But you need to get at the actual 'new' SSA names for the replacements as I guess they'll be rewritten? Or maybe those are not. >? Could we then get away with returning a 0 TODO at the end? Yes. Richard. >Thanks, >Richard
Re: Add a loop versioning pass
Richard Biener writes: > On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford >> Tested on x86_64-linux-gnu, aarch64-linux-gnu and aarch64_be-elf. >> Also repeated the performance testing (but haven't yet tried an >> LTO variant; will do that over the weekend). > > Any results? Sorry, I should've remembered that finding time to run tests is easy, finding time to analyse them is hard. Speed-wise, the impact of the patch for LTO is similar to without, with 554.roms_r being the main beneficiary for both AArch64 and x86_64. I get a 6.8% improvement on Cortex-A57 with -Ofast -mcpu=native -flto=jobserver. Size-wise, there are three tests that grow by >=2% on x86_64: 549.fotonik3d_r: 5.5% 548.exchange2_r: 29.5% 554.roms_r: 39.6% The roms increase seems fair enough given the benefit, since the whole program uses a similar style for handling arrays. fotonik3d is a mixed bag. Some versioning conditions are from things like: subroutine foo(a) real :: a(:,:,:) a(1,:,:) = ... end subroutine where we version for the middle dimension having a stride of 1. This could be eliminated if we had more semantic information. Other conditions are for things like: subroutine foo(a) real :: a(:,:,:) a(:,1,:) = ... end subroutine where the pass really does help, but unfortunately those loops aren't hot and might not even be used at all. Some opportunities are cases in which we avoid gather loads, such as in the "mp" loads in the hot __material_mod_MOD_mat_updatee function. But mp feeds a gather load itself and these assignments aren't a critical part of the loop. (I'm testing on an AVX-only system, so these are synthesised gathers rather than real gathers.) The growth in 548.exchange2_r comes from reasonable changes to cold code. The test spends almost all its time in __brute_force_MOD_digits_2, which contains the same code before and after the patch, but which accounts for less than 1% of .text before the patch. > I've skimmed over the updated patch and it looks > a lot better now. > > +bool > +loop_versioning > +::find_per_loop_multiplication (address_info , address_term_info > ) > +{ > > is that what coding convention allows? Not sure we've settled on something, so I improvised. > For grepping I'd then say we should do > > bool loop_versioning:: > find_per_loop_multiplication (...) > > ;) Sounds good to me. > Anywhere else we you use > > loop_versioning::foo > > so please stick to that. Yeah, I used that when I could avoid an argument spilling over 80 chars, but I think I missed that the above function now fits into that category, Will double-check the others. > Otherwise OK. Thanks :-) > I think I don't see a testcase where we could version both loops in a nest > so I'm not sure whether the transform works fine when you are only > updating SSA form at the very end of the pass. You mean something like: real :: foo(:,:), bar(:) do i=1,n do j=1,n foo(i,j) = ... end do bar(i) = .. end do ? I can add a test if so. At the moment the pass only looks for direct versioning opportunities in inner loops, so the assignment to bar wouldn't be treated as a versioning opportunity. We should still hoist the version check for foo outside both loops, which is tested by loop_versioning_4.f90 for cases in which the outer loop doesn't have its own array arithmetic, but isn't tested for cases like the above. > There may also be some subtle issues with substitute_and_fold being > applied to non-up-to-date SSA form given it folds stmts looking at > (single-use!) SSA edges. The single-use guard might be what saves you > here (SSA uses in the copies are not yet updated to point to the > copied DEFs). OK. I was hoping that because we only apply substitute_and_fold to new code, there would be no problem with uses elsewhere. Would it be safer to: - version all loops we want to version - update SSA explicitly - apply substitute and fold to all "new" loops ? Could we then get away with returning a 0 TODO at the end? Thanks, Richard
Re: Add a loop versioning pass
On Thu, Dec 6, 2018 at 2:19 PM Richard Sandiford wrote: > > Richard Biener writes: > >> > The pass contains an awful lot of heuristics :/ Like last year > >> > with the interchange pass I would suggest to rip most of it out > >> > and first lay infrastructure with the cases you can positively > >> > identify without applying heuristics or "hacks" like stripping > >> > semantically required casts. That makes it also clear which > >> > testcases test which code-path. That said, all the analyze > >> > multiplications/plusses/factors stuff was extremely hard to review > >> > and I have no overall picture why this is all so complicated or > >> > necessary. > >> > >> I think the tests do cover most of this -- was glad to see that > >> when Kyrill tried taking something out, one of the tests started > >> to fail :-). And the comments in the tests as well as the code > >> are supposed to explain why this is desirable. But I can try to > >> spruce up the comments in the code if they're not detailed enough. > >> > >> The problem is that the motivating (Fortran) case can't be identified > >> without applying heuristics. All we see is a collection of adds and > >> multiplies, with no semantic information to say which multiplies are for > >> the inner dimension and which aren't. I agree it would be nicer not > >> to have them, which is why I'd originally suggested the IFN to provide > >> information about dimensions. > > > > Sure, still a new pass having _all_ the heuristic built in with > > 10s of testcases do not make it easy to review those heuristics... > > Fair :-) In the patch below I've tried to cut down on the special cases > and tried to add more comments explaining which patterns the code is > trying to detect/reject. Probably the key part is the one beginning: > > + The main difficulty here isn't finding strides that could be used > + in a version check (that's pretty easy). The problem instead is to > + avoid versioning for some stride S that is unlikely ever to be 1 at > + runtime. Versioning for S == 1 on its own would lead to unnecessary > + code bloat, while adding S == 1 to more realistic version conditions > + would lose the optimisation opportunity offered by those other conditions. > > >> >> +/* Return true if in principle it is worth versioning an index > >> >> fragment of > >> >> + the form: > >> >> + > >> >> + (i * b * SCALE) / FACTOR > >> >> + > >> >> + for the case in which b == 1. */ > >> >> + > >> >> +bool > >> >> +loop_versioning::acceptable_scale_p (tree scale, poly_uint64 factor) > >> >> +{ > >> >> + /* See whether SCALE is a constant multiple of FACTOR, and if the > >> >> + multiple is small enough for us to treat it as a potential grouped > >> >> + access. For example: > >> >> + > >> >> + for (auto i : ...) > >> >> + y[i] = f (x[4 * i * stride], > >> >> +x[4 * i * stride + 1], > >> >> +x[4 * i * stride + 2]); > >> >> + > >> >> + would benefit from versioning for the case in which stride == 1. > >> >> + High multiples of i * stride are less likely to benefit, and could > >> >> + indicate a simulated multi-dimensional array. > >> >> + > >> >> + This is just a heuristic, to avoid having to do expensive group > >> >> + analysis of the data references in a loop. */ > >> >> + poly_uint64 const_scale; > >> >> + unsigned int multiple; > >> >> + if (poly_int_tree_p (scale, _scale) > >> >> + && constant_multiple_p (const_scale, factor, )) > >> >> +{ > >> >> + unsigned int maxval = PARAM_VALUE > >> >> (PARAM_LOOP_VERSIONING_GROUP_SIZE); > >> >> + return IN_RANGE (multiple, 1, maxval); > >> > > >> > Hmm. So you _do_ want to version sth like > >> > > >> > struct X { int i; int j; } a[2048]; > >> > > >> > for (int i = start; i < end; ++i) > >> > a[i*s].i = 1; > >> > > >> > ? That is, even with s == 1 the accesses will not become contiguous? > >> > OK, I suppose that's because you are looking at a stmt in isolation > >> > and another stmt may access a[i*s].j here. > >> > > >> > That is, would it be a future improvement to run sth like the > >> > vectorizers group access analysis on the references and perform > >> > this check on whole groups then, possibly better being able to > >> > constrain what is now the magic parameter > >> > PARAM_LOOP_VERSIONING_GROUP_SIZE? > >> > >> Yeah, possibly. The problem is that we might end up having to reproduce > >> vectoriser heuristics. E.g. stores with gaps should be vectorisable > >> in future with SVE, using contiguous predicated stores in which some > >> lanes are inactive. So we don't necessarily need the .j access for > >> this to be worthwhile. > >> > >> But perhaps the argument for versioning for vectorisation is stronger > >> for grouped accesses than contiguous ones. > >> > >> Note that the above example doesn't rely on the grouping heuristic > >> since we just strip the COMPONENT_REF and then analyze the ARRAY_REF > >> with a
Re: Add a loop versioning pass
Richard Biener writes: >> > The pass contains an awful lot of heuristics :/ Like last year >> > with the interchange pass I would suggest to rip most of it out >> > and first lay infrastructure with the cases you can positively >> > identify without applying heuristics or "hacks" like stripping >> > semantically required casts. That makes it also clear which >> > testcases test which code-path. That said, all the analyze >> > multiplications/plusses/factors stuff was extremely hard to review >> > and I have no overall picture why this is all so complicated or >> > necessary. >> >> I think the tests do cover most of this -- was glad to see that >> when Kyrill tried taking something out, one of the tests started >> to fail :-). And the comments in the tests as well as the code >> are supposed to explain why this is desirable. But I can try to >> spruce up the comments in the code if they're not detailed enough. >> >> The problem is that the motivating (Fortran) case can't be identified >> without applying heuristics. All we see is a collection of adds and >> multiplies, with no semantic information to say which multiplies are for >> the inner dimension and which aren't. I agree it would be nicer not >> to have them, which is why I'd originally suggested the IFN to provide >> information about dimensions. > > Sure, still a new pass having _all_ the heuristic built in with > 10s of testcases do not make it easy to review those heuristics... Fair :-) In the patch below I've tried to cut down on the special cases and tried to add more comments explaining which patterns the code is trying to detect/reject. Probably the key part is the one beginning: + The main difficulty here isn't finding strides that could be used + in a version check (that's pretty easy). The problem instead is to + avoid versioning for some stride S that is unlikely ever to be 1 at + runtime. Versioning for S == 1 on its own would lead to unnecessary + code bloat, while adding S == 1 to more realistic version conditions + would lose the optimisation opportunity offered by those other conditions. >> >> +/* Return true if in principle it is worth versioning an index fragment >> >> of >> >> + the form: >> >> + >> >> + (i * b * SCALE) / FACTOR >> >> + >> >> + for the case in which b == 1. */ >> >> + >> >> +bool >> >> +loop_versioning::acceptable_scale_p (tree scale, poly_uint64 factor) >> >> +{ >> >> + /* See whether SCALE is a constant multiple of FACTOR, and if the >> >> + multiple is small enough for us to treat it as a potential grouped >> >> + access. For example: >> >> + >> >> + for (auto i : ...) >> >> + y[i] = f (x[4 * i * stride], >> >> +x[4 * i * stride + 1], >> >> +x[4 * i * stride + 2]); >> >> + >> >> + would benefit from versioning for the case in which stride == 1. >> >> + High multiples of i * stride are less likely to benefit, and could >> >> + indicate a simulated multi-dimensional array. >> >> + >> >> + This is just a heuristic, to avoid having to do expensive group >> >> + analysis of the data references in a loop. */ >> >> + poly_uint64 const_scale; >> >> + unsigned int multiple; >> >> + if (poly_int_tree_p (scale, _scale) >> >> + && constant_multiple_p (const_scale, factor, )) >> >> +{ >> >> + unsigned int maxval = PARAM_VALUE >> >> (PARAM_LOOP_VERSIONING_GROUP_SIZE); >> >> + return IN_RANGE (multiple, 1, maxval); >> > >> > Hmm. So you _do_ want to version sth like >> > >> > struct X { int i; int j; } a[2048]; >> > >> > for (int i = start; i < end; ++i) >> > a[i*s].i = 1; >> > >> > ? That is, even with s == 1 the accesses will not become contiguous? >> > OK, I suppose that's because you are looking at a stmt in isolation >> > and another stmt may access a[i*s].j here. >> > >> > That is, would it be a future improvement to run sth like the >> > vectorizers group access analysis on the references and perform >> > this check on whole groups then, possibly better being able to >> > constrain what is now the magic parameter PARAM_LOOP_VERSIONING_GROUP_SIZE? >> >> Yeah, possibly. The problem is that we might end up having to reproduce >> vectoriser heuristics. E.g. stores with gaps should be vectorisable >> in future with SVE, using contiguous predicated stores in which some >> lanes are inactive. So we don't necessarily need the .j access for >> this to be worthwhile. >> >> But perhaps the argument for versioning for vectorisation is stronger >> for grouped accesses than contiguous ones. >> >> Note that the above example doesn't rely on the grouping heuristic >> since we just strip the COMPONENT_REF and then analyze the ARRAY_REF >> with a factor of 1. The heuristic instead handles things like: >> >> a[i * stride * 2] >> >> etc., and the param is more there to decide when a constant factor is >> small enough to be a realistic group size instead of an outer dimension. >> E.g. if we see: >> >>
Re: Add a loop versioning pass
On Wed, 28 Nov 2018, Richard Sandiford wrote: > Thanks for the review and sorry for the long time replying. Likewise ... > Richard Biener writes: > >> This patch adds a pass that versions loops with variable index strides > >> for the case in which the stride is 1. E.g.: > >> > >> for (int i = 0; i < n; ++i) > >> x[i * stride] = ...; > >> > >> becomes: > >> > >> if (stepx == 1) > >> for (int i = 0; i < n; ++i) > >> x[i] = ...; > >> else > >> for (int i = 0; i < n; ++i) > >> x[i * stride] = ...; > >> > >> This is useful for both vector code and scalar code, and in some cases > >> can enable further optimisations like loop interchange or pattern > >> recognition. > >> > >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 > >> and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests > >> that regress. > >> > >> Sizewise, there's a 10% increase in .text for both 554.roms_r and > >> 465.tonto. That's obviously a lot, but in tonto's case it's because > >> the whole program is written using assumed-shape arrays and pointers, > >> so a large number of functions really do benefit from versioning. > >> roms likewise makes heavy use of assumed-shape arrays, and that > >> improvement in performance IMO justifies the code growth. > > > > Ouch. I know that at least with LTO IPA-CP can do "quite" some > > propagation of constant strides. Not sure if we're aggressive > > enough in actually doing the cloning for all cases we figure out > > strides though. But my question is how we can avoid doing the > > versioning for loops in the copy that did not have the IPA-CPed > > stride of one? Ideally we'd be able to mark individual references > > as {definitely,likely,unlikely,not}-unit-stride? > > This is a more elaborate version of what I was trying to do with > the IFN I'd mentioned on IRC a few weeks back. It would be a bit > like a long-living ASSERT_EXPR that provides hints about the value > of the SSA name. Another alternative would be to annotate the > SSA name itself, but then we'd easily lose the information during > optimisation. > > But will IPA-CP conditionally use a clone for a stride of 1 for > calls that pass a variable stride that might or might not be 1? > E.g. if it clones foo as foo.1 for calls in which a stride parameter > is 1 at compile time, does it also make foo call foo.1 when the stride > parameter is 1 at runtime? If not, that sounds like a missed optimisation. > If so, prune_conditions should stop us from versioning. Martin answered that. > >> The next biggest .text increase is 4.5% for 548.exchange2_r. I did see > >> a small (0.4%) speed improvement there, but although both 3-iteration runs > >> produced stable results, that might still be noise. There was a slightly > >> larger (non-noise) improvement for a 256-bit SVE model. > >> > >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > >> without any noticeable improvement in performance. No other test grew > >> by more than 2%. > >> > >> Although the main SPEC beneficiaries are all Fortran tests, the > >> benchmarks we use for SVE also include some C and C++ tests that > >> benefit. > > > > Did you see any slowdown, for example because versioning was forced > > to be on an innermost loop? I'm thinking of the testcase in > > PR87561 where we do have strided accesses in the innermost loop. > > Not so far. > > > Since you cite performance numbers how did you measure them? > > I assume -Ofast -march=native but did you check with -flto? > > -O3 -march=native. No, didn't check LTO, will do that. > > >> Using -frepack-arrays gives the same benefits in many Fortran cases. > >> The problem is that using that option inappropriately can force a full > >> array copy for arguments that the function only reads once, and so it > >> isn't really something we can turn on by default. The new pass is > >> supposed to give most of the benefits of -frepack-arrays without > >> the risk of unnecessary repacking. > >> > >> The patch therefore enables the pass by default at -O3. > > > > I think that's reasonable. > > > > One possible enhancement would be to add a value-profile for the > > strides so we can guide this optimization better. > > > > The pass falls foul of C++ class make small methods of everything. > > That makes following the code very hard. Please inline single-used > > methods in callers wherever possible to make the code read > > more like GCC code (using GCC API). > > I'll bite my tongue on this one :-) > > > The pass contains an awful lot of heuristics :/ Like last year > > with the interchange pass I would suggest to rip most of it out > > and first lay infrastructure with the cases you can positively > > identify without applying heuristics or "hacks" like stripping > > semantically required casts. That makes it also clear which > > testcases test which code-path. That said, all the analyze > >
Re: Add a loop versioning pass
Hi, On Wed, Nov 28 2018, Richard Sandiford wrote: >>> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 >>> and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests >>> that regress. >>> >>> Sizewise, there's a 10% increase in .text for both 554.roms_r and >>> 465.tonto. That's obviously a lot, but in tonto's case it's because >>> the whole program is written using assumed-shape arrays and pointers, >>> so a large number of functions really do benefit from versioning. >>> roms likewise makes heavy use of assumed-shape arrays, and that >>> improvement in performance IMO justifies the code growth. >> >> Ouch. I know that at least with LTO IPA-CP can do "quite" some >> propagation of constant strides. Not sure if we're aggressive >> enough in actually doing the cloning for all cases we figure out >> strides though. But my question is how we can avoid doing the >> versioning for loops in the copy that did not have the IPA-CPed >> stride of one? Ideally we'd be able to mark individual references >> as {definitely,likely,unlikely,not}-unit-stride? > > This is a more elaborate version of what I was trying to do with > the IFN I'd mentioned on IRC a few weeks back. It would be a bit > like a long-living ASSERT_EXPR that provides hints about the value > of the SSA name. Another alternative would be to annotate the > SSA name itself, but then we'd easily lose the information during > optimisation. > > But will IPA-CP conditionally use a clone for a stride of 1 for > calls that pass a variable stride that might or might not be 1? > E.g. if it clones foo as foo.1 for calls in which a stride parameter > is 1 at compile time, does it also make foo call foo.1 when the stride > parameter is 1 at runtime? If not, that sounds like a missed optimisation. > If so, prune_conditions should stop us from versioning. IPA-CP only creates clones for compile-time constants, it does not add any run-time check to the callers. It is an interesting idea that it could add these for some cases when it decides to do the cloning anyway. Martin
Re: Add a loop versioning pass
Thanks for the review and sorry for the long time replying. Richard Biener writes: >> This patch adds a pass that versions loops with variable index strides >> for the case in which the stride is 1. E.g.: >> >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> becomes: >> >> if (stepx == 1) >> for (int i = 0; i < n; ++i) >> x[i] = ...; >> else >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> This is useful for both vector code and scalar code, and in some cases >> can enable further optimisations like loop interchange or pattern >> recognition. >> >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 >> and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests >> that regress. >> >> Sizewise, there's a 10% increase in .text for both 554.roms_r and >> 465.tonto. That's obviously a lot, but in tonto's case it's because >> the whole program is written using assumed-shape arrays and pointers, >> so a large number of functions really do benefit from versioning. >> roms likewise makes heavy use of assumed-shape arrays, and that >> improvement in performance IMO justifies the code growth. > > Ouch. I know that at least with LTO IPA-CP can do "quite" some > propagation of constant strides. Not sure if we're aggressive > enough in actually doing the cloning for all cases we figure out > strides though. But my question is how we can avoid doing the > versioning for loops in the copy that did not have the IPA-CPed > stride of one? Ideally we'd be able to mark individual references > as {definitely,likely,unlikely,not}-unit-stride? This is a more elaborate version of what I was trying to do with the IFN I'd mentioned on IRC a few weeks back. It would be a bit like a long-living ASSERT_EXPR that provides hints about the value of the SSA name. Another alternative would be to annotate the SSA name itself, but then we'd easily lose the information during optimisation. But will IPA-CP conditionally use a clone for a stride of 1 for calls that pass a variable stride that might or might not be 1? E.g. if it clones foo as foo.1 for calls in which a stride parameter is 1 at compile time, does it also make foo call foo.1 when the stride parameter is 1 at runtime? If not, that sounds like a missed optimisation. If so, prune_conditions should stop us from versioning. >> The next biggest .text increase is 4.5% for 548.exchange2_r. I did see >> a small (0.4%) speed improvement there, but although both 3-iteration runs >> produced stable results, that might still be noise. There was a slightly >> larger (non-noise) improvement for a 256-bit SVE model. >> >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but >> without any noticeable improvement in performance. No other test grew >> by more than 2%. >> >> Although the main SPEC beneficiaries are all Fortran tests, the >> benchmarks we use for SVE also include some C and C++ tests that >> benefit. > > Did you see any slowdown, for example because versioning was forced > to be on an innermost loop? I'm thinking of the testcase in > PR87561 where we do have strided accesses in the innermost loop. Not so far. > Since you cite performance numbers how did you measure them? > I assume -Ofast -march=native but did you check with -flto? -O3 -march=native. No, didn't check LTO, will do that. >> Using -frepack-arrays gives the same benefits in many Fortran cases. >> The problem is that using that option inappropriately can force a full >> array copy for arguments that the function only reads once, and so it >> isn't really something we can turn on by default. The new pass is >> supposed to give most of the benefits of -frepack-arrays without >> the risk of unnecessary repacking. >> >> The patch therefore enables the pass by default at -O3. > > I think that's reasonable. > > One possible enhancement would be to add a value-profile for the > strides so we can guide this optimization better. > > The pass falls foul of C++ class make small methods of everything. > That makes following the code very hard. Please inline single-used > methods in callers wherever possible to make the code read > more like GCC code (using GCC API). I'll bite my tongue on this one :-) > The pass contains an awful lot of heuristics :/ Like last year > with the interchange pass I would suggest to rip most of it out > and first lay infrastructure with the cases you can positively > identify without applying heuristics or "hacks" like stripping > semantically required casts. That makes it also clear which > testcases test which code-path. That said, all the analyze > multiplications/plusses/factors stuff was extremely hard to review > and I have no overall picture why this is all so complicated or > necessary. I think the tests do cover most of this -- was glad to see that when Kyrill tried taking something out, one of the tests started to fail :-). And the comments in
Re: Add a loop versioning pass
David Malcolm writes: > On Wed, 2018-10-24 at 14:05 +0100, Richard Sandiford wrote: >> This patch adds a pass that versions loops with variable index >> strides >> for the case in which the stride is 1. E.g.: >> >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> becomes: >> >> if (stepx == 1) >> for (int i = 0; i < n; ++i) >> x[i] = ...; >> else >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> This is useful for both vector code and scalar code, and in some >> cases >> can enable further optimisations like loop interchange or pattern >> recognition. >> >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 >> and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests >> that regress. >> >> Sizewise, there's a 10% increase in .text for both 554.roms_r and >> 465.tonto. That's obviously a lot, but in tonto's case it's because >> the whole program is written using assumed-shape arrays and pointers, >> so a large number of functions really do benefit from versioning. >> roms likewise makes heavy use of assumed-shape arrays, and that >> improvement in performance IMO justifies the code growth. >> >> The next biggest .text increase is 4.5% for 548.exchange2_r. I did >> see >> a small (0.4%) speed improvement there, but although both 3-iteration >> runs >> produced stable results, that might still be noise. There was a >> slightly >> larger (non-noise) improvement for a 256-bit SVE model. >> >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but >> without any noticeable improvement in performance. No other test >> grew >> by more than 2%. >> >> Although the main SPEC beneficiaries are all Fortran tests, the >> benchmarks we use for SVE also include some C and C++ tests that >> benefit. >> >> Using -frepack-arrays gives the same benefits in many Fortran cases. >> The problem is that using that option inappropriately can force a >> full >> array copy for arguments that the function only reads once, and so it >> isn't really something we can turn on by default. The new pass is >> supposed to give most of the benefits of -frepack-arrays without >> the risk of unnecessary repacking. >> >> The patch therefore enables the pass by default at -O3. >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? >> >> Richard >> > > [...snip...] > >> +/* Run the pass and return a set of TODO_* flags. */ >> + >> +unsigned int >> +loop_versioning::run () >> +{ >> + gcc_assert (scev_initialized_p ()); >> + >> + if (!analyze_blocks () >> + || !prune_conditions () >> + || !make_versioning_decisions () >> + || !implement_versioning_decisions ()) >> +return 0; >> + >> + return TODO_update_ssa; >> +} >> + >> +/* Loop versioningting pass. */ > > (typo) Huh, no idea how I even got there. >> + >> +namespace { > > Could the whole file be within this anonymous namespace, rather than > just the opt_pass subclass? (hiding class loop_versioning, so that the > optimizer knows that the only thing visible outside the TU is > make_pass_loop_versioning). This can be a pain to debug, but you can > always comment out the anon namespace locally when debugging. Yeah, I prefer that style too, so that the TU only exports public interfaces. I'd got the impression from earlier threads that this was frowned on for GCC and so I was deliberately avoiding it, but if it's OK then great. Thanks, Richard
Re: Add a loop versioning pass
(sorry for breaking threading -- I composed a review mail offline but gmail has no way of nicely sending that neither has it a way to bounce messages...) > This patch adds a pass that versions loops with variable index strides > for the case in which the stride is 1. E.g.: > > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > becomes: > > if (stepx == 1) > for (int i = 0; i < n; ++i) > x[i] = ...; > else > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > This is useful for both vector code and scalar code, and in some cases > can enable further optimisations like loop interchange or pattern > recognition. > > The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 > and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests > that regress. > > Sizewise, there's a 10% increase in .text for both 554.roms_r and > 465.tonto. That's obviously a lot, but in tonto's case it's because > the whole program is written using assumed-shape arrays and pointers, > so a large number of functions really do benefit from versioning. > roms likewise makes heavy use of assumed-shape arrays, and that > improvement in performance IMO justifies the code growth. Ouch. I know that at least with LTO IPA-CP can do "quite" some propagation of constant strides. Not sure if we're aggressive enough in actually doing the cloning for all cases we figure out strides though. But my question is how we can avoid doing the versioning for loops in the copy that did not have the IPA-CPed stride of one? Ideally we'd be able to mark individual references as {definitely,likely,unlikely,not}-unit-stride? > The next biggest .text increase is 4.5% for 548.exchange2_r. I did see > a small (0.4%) speed improvement there, but although both 3-iteration runs > produced stable results, that might still be noise. There was a slightly > larger (non-noise) improvement for a 256-bit SVE model. > > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > without any noticeable improvement in performance. No other test grew > by more than 2%. > > Although the main SPEC beneficiaries are all Fortran tests, the > benchmarks we use for SVE also include some C and C++ tests that > benefit. Did you see any slowdown, for example because versioning was forced to be on an innermost loop? I'm thinking of the testcase in PR87561 where we do have strided accesses in the innermost loop. Since you cite performance numbers how did you measure them? I assume -Ofast -march=native but did you check with -flto? > Using -frepack-arrays gives the same benefits in many Fortran cases. > The problem is that using that option inappropriately can force a full > array copy for arguments that the function only reads once, and so it > isn't really something we can turn on by default. The new pass is > supposed to give most of the benefits of -frepack-arrays without > the risk of unnecessary repacking. > > The patch therefore enables the pass by default at -O3. I think that's reasonable. One possible enhancement would be to add a value-profile for the strides so we can guide this optimization better. The pass falls foul of C++ class make small methods of everything. That makes following the code very hard. Please inline single-used methods in callers wherever possible to make the code read more like GCC code (using GCC API). The pass contains an awful lot of heuristics :/ Like last year with the interchange pass I would suggest to rip most of it out and first lay infrastructure with the cases you can positively identify without applying heuristics or "hacks" like stripping semantically required casts. That makes it also clear which testcases test which code-path. That said, all the analyze multiplications/plusses/factors stuff was extremely hard to review and I have no overall picture why this is all so complicated or necessary. > Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? > > Richard > > > 2018-10-24 Richard Sandiford > > gcc/ > * doc/invoke.texi (-fversion-loops-for-strides): Document > (loop-versioning-group-size, loop-versioning-max-inner-insns) > (loop-versioning-max-outer-insns): Document new --params. > * Makefile.in (OBJS): Add gimple-loop-versioning.o. > * common.opt (fversion-loops-for-strides): New option. > * opts.c (default_options_table): Enable fversion-loops-for-strides > at -O3. > * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) > (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) > (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. > * passes.def: Add pass_loop_versioning. > * timevar.def (TV_LOOP_VERSIONING): New time variable. > * tree-ssa-propagate.h > (substitute_and_fold_engine::substitute_and_fold): Add an optional > block parameter. > * tree-ssa-propagate.c > (substitute_and_fold_engine::substitute_and_fold):
Re: Add a loop versioning pass
On Thu, Oct 25, 2018 at 3:52 PM Richard Biener wrote: > > On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford > wrote: > > > > This patch adds a pass that versions loops with variable index strides > > for the case in which the stride is 1. E.g.: > > > > for (int i = 0; i < n; ++i) > > x[i * stride] = ...; > > > > becomes: > > > > if (stepx == 1) > > for (int i = 0; i < n; ++i) > > x[i] = ...; > > else > > for (int i = 0; i < n; ++i) > > x[i * stride] = ...; > > > > This is useful for both vector code and scalar code, and in some cases > > can enable further optimisations like loop interchange or pattern > > recognition. > > > > The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 > > and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests > > that regress. > > > > Sizewise, there's a 10% increase in .text for both 554.roms_r and > > 465.tonto. That's obviously a lot, but in tonto's case it's because > > the whole program is written using assumed-shape arrays and pointers, > > so a large number of functions really do benefit from versioning. > > roms likewise makes heavy use of assumed-shape arrays, and that > > improvement in performance IMO justifies the code growth. > > > > The next biggest .text increase is 4.5% for 548.exchange2_r. I did see > > a small (0.4%) speed improvement there, but although both 3-iteration runs > > produced stable results, that might still be noise. There was a slightly > > larger (non-noise) improvement for a 256-bit SVE model. > > > > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > > without any noticeable improvement in performance. No other test grew > > by more than 2%. > > > > Although the main SPEC beneficiaries are all Fortran tests, the > > benchmarks we use for SVE also include some C and C++ tests that > > benefit. > > > > Using -frepack-arrays gives the same benefits in many Fortran cases. > > The problem is that using that option inappropriately can force a full > > array copy for arguments that the function only reads once, and so it > > isn't really something we can turn on by default. The new pass is > > supposed to give most of the benefits of -frepack-arrays without > > the risk of unnecessary repacking. > > > > The patch therefore enables the pass by default at -O3. > > > > Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? > > I will give this a thorough review tomorror (sorry for the delay), So I didn't really finish but one thing I noticed is > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, ";; Want to version loop %d (depth %d)" > +" for when ", loop->num, loop_depth (loop)); > + print_generic_expr (dump_file, name, TDF_SLIM); > + fprintf (dump_file, " == 1"); Since you are writing a new pass you want to use the new dump interface. if (dump_enabled_p ()) dump_printf (MSG_NOTE, ";; Want to version loop %d (depth %d)" " for when %E == 1", loop->num, loop_depth (loop), name); ... it's much nicer to be able to use %E/%G than separate calls for the tree parts. I'm also cut my overall comment part but not the incomplete set of individual comments/questions yet (hope to finish that on Monday): > Sizewise, there's a 10% increase in .text for both 554.roms_r and > 465.tonto. That's obviously a lot, but in tonto's case it's because > the whole program is written using assumed-shape arrays and pointers, > so a large number of functions really do benefit from versioning. > roms likewise makes heavy use of assumed-shape arrays, and that > improvement in performance IMO justifies the code growth. Ouch. I know that at least with LTO IPA-CP can do "quite" some propagation of constant strides. Not sure if we're aggressive enough in actually doing the cloning for all cases we figure out strides though. But my question is how we can avoid doing the versioning for loops in the copy that did not have the IPA-CPed stride of one? Ideally we'd be able to mark individual references as {definitely,likely,unlikely,not}-unit-stride? > The next biggest .text increase is 4.5% for 548.exchange2_r. I did see > a small (0.4%) speed improvement there, but although both 3-iteration runs > produced stable results, that might still be noise. There was a slightly > larger (non-noise) improvement for a 256-bit SVE model. > > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > without any noticeable improvement in performance. No other test grew > by more than 2%. > > Although the main SPEC beneficiaries are all Fortran tests, the > benchmarks we use for SVE also include some C and C++ tests that > benefit. Did you see any slowdown, for example because versioning was forced to be on an innermost loop? I'm thinking of the testcase in PR87561 where we do have strided accesses in the innermost loop. Since you cite performance numbers how did
Re: Add a loop versioning pass
On Wed, 2018-10-24 at 14:05 +0100, Richard Sandiford wrote: > This patch adds a pass that versions loops with variable index > strides > for the case in which the stride is 1. E.g.: > > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > becomes: > > if (stepx == 1) > for (int i = 0; i < n; ++i) > x[i] = ...; > else > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > This is useful for both vector code and scalar code, and in some > cases > can enable further optimisations like loop interchange or pattern > recognition. > > The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 > and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests > that regress. > > Sizewise, there's a 10% increase in .text for both 554.roms_r and > 465.tonto. That's obviously a lot, but in tonto's case it's because > the whole program is written using assumed-shape arrays and pointers, > so a large number of functions really do benefit from versioning. > roms likewise makes heavy use of assumed-shape arrays, and that > improvement in performance IMO justifies the code growth. > > The next biggest .text increase is 4.5% for 548.exchange2_r. I did > see > a small (0.4%) speed improvement there, but although both 3-iteration > runs > produced stable results, that might still be noise. There was a > slightly > larger (non-noise) improvement for a 256-bit SVE model. > > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > without any noticeable improvement in performance. No other test > grew > by more than 2%. > > Although the main SPEC beneficiaries are all Fortran tests, the > benchmarks we use for SVE also include some C and C++ tests that > benefit. > > Using -frepack-arrays gives the same benefits in many Fortran cases. > The problem is that using that option inappropriately can force a > full > array copy for arguments that the function only reads once, and so it > isn't really something we can turn on by default. The new pass is > supposed to give most of the benefits of -frepack-arrays without > the risk of unnecessary repacking. > > The patch therefore enables the pass by default at -O3. > > Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? > > Richard > [...snip...] > +/* Run the pass and return a set of TODO_* flags. */ > + > +unsigned int > +loop_versioning::run () > +{ > + gcc_assert (scev_initialized_p ()); > + > + if (!analyze_blocks () > + || !prune_conditions () > + || !make_versioning_decisions () > + || !implement_versioning_decisions ()) > +return 0; > + > + return TODO_update_ssa; > +} > + > +/* Loop versioningting pass. */ (typo) > + > +namespace { Could the whole file be within this anonymous namespace, rather than just the opt_pass subclass? (hiding class loop_versioning, so that the optimizer knows that the only thing visible outside the TU is make_pass_loop_versioning). This can be a pain to debug, but you can always comment out the anon namespace locally when debugging. > + > +const pass_data pass_data_loop_versioning = > +{ > + GIMPLE_PASS, /* type */ > + "lversion", /* name */ > + OPTGROUP_LOOP, /* optinfo_flags */ > + TV_LOOP_VERSIONING, /* tv_id */ > + PROP_cfg, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + 0, /* todo_flags_finish */ > +}; [...snip...] > + > +} // anon namespace > + > +gimple_opt_pass * > +make_pass_loop_versioning (gcc::context *ctxt) > +{ > + return new pass_loop_versioning (ctxt); > +} [...snip...]
Re: Add a loop versioning pass
Richard Biener writes: > On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford > wrote: >> >> This patch adds a pass that versions loops with variable index strides >> for the case in which the stride is 1. E.g.: >> >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> becomes: >> >> if (stepx == 1) >> for (int i = 0; i < n; ++i) >> x[i] = ...; >> else >> for (int i = 0; i < n; ++i) >> x[i * stride] = ...; >> >> This is useful for both vector code and scalar code, and in some cases >> can enable further optimisations like loop interchange or pattern >> recognition. >> >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 >> and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests >> that regress. >> >> Sizewise, there's a 10% increase in .text for both 554.roms_r and >> 465.tonto. That's obviously a lot, but in tonto's case it's because >> the whole program is written using assumed-shape arrays and pointers, >> so a large number of functions really do benefit from versioning. >> roms likewise makes heavy use of assumed-shape arrays, and that >> improvement in performance IMO justifies the code growth. >> >> The next biggest .text increase is 4.5% for 548.exchange2_r. I did see >> a small (0.4%) speed improvement there, but although both 3-iteration runs >> produced stable results, that might still be noise. There was a slightly >> larger (non-noise) improvement for a 256-bit SVE model. >> >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but >> without any noticeable improvement in performance. No other test grew >> by more than 2%. >> >> Although the main SPEC beneficiaries are all Fortran tests, the >> benchmarks we use for SVE also include some C and C++ tests that >> benefit. >> >> Using -frepack-arrays gives the same benefits in many Fortran cases. >> The problem is that using that option inappropriately can force a full >> array copy for arguments that the function only reads once, and so it >> isn't really something we can turn on by default. The new pass is >> supposed to give most of the benefits of -frepack-arrays without >> the risk of unnecessary repacking. >> >> The patch therefore enables the pass by default at -O3. >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? > > I will give this a thorough review tomorror (sorry for the delay), but > one question that comes up is how we avoid excessive versioning. > Consider this pass versioning for stride1, then loop distribution > versioning for aliasing and then vectorization versioning for alignment. Yeah, this could happen. :-) > Do we have any idea on how to track these versionings so we can > eventually commonize paths that didn't get any followup optimization > applied? The idea was that this pass would help even for scalar code (due to cheaper addressing), so it's not something we should necessarily undo if there is no follow-on optimisation. > Also consider the case where loop distribution applies transforms > with versioning for aliasing to _both_ loop nests from your versioning. I think that's a good thing though. The aliasing checks are going to be cheaper for a stride of 1 compared to a variable stride. > Maybe we can at least add some loop->number_of_times_versioned > counter and some dumping/statistics to get an idea how big an > issue this might be? That might help. If we could be more confident that the "faster" loop versions are very likely to be used, we could throttle the optimisation of the unlikely versions. But so far I haven't seen an example where this really hurts. Thanks, Richard > > Thanks, > Richard. > >> Richard >> >> >> 2018-10-24 Richard Sandiford >> >> gcc/ >> * doc/invoke.texi (-fversion-loops-for-strides): Document >> (loop-versioning-group-size, loop-versioning-max-inner-insns) >> (loop-versioning-max-outer-insns): Document new --params. >> * Makefile.in (OBJS): Add gimple-loop-versioning.o. >> * common.opt (fversion-loops-for-strides): New option. >> * opts.c (default_options_table): Enable fversion-loops-for-strides >> at -O3. >> * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) >> (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) >> (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. >> * passes.def: Add pass_loop_versioning. >> * timevar.def (TV_LOOP_VERSIONING): New time variable. >> * tree-ssa-propagate.h >> (substitute_and_fold_engine::substitute_and_fold): Add an optional >> block parameter. >> * tree-ssa-propagate.c >> (substitute_and_fold_engine::substitute_and_fold): Likewise. >> When passed, only walk blocks dominated by that block. >> * tree-vrp.h (range_includes_p): Declare. >> (range_includes_zero_p): Turn into an inline wrapper around >> range_includes_p. >> * tree-vrp.c (range_includes_p):
Re: Add a loop versioning pass
On Wed, Oct 24, 2018 at 3:05 PM Richard Sandiford wrote: > > This patch adds a pass that versions loops with variable index strides > for the case in which the stride is 1. E.g.: > > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > becomes: > > if (stepx == 1) > for (int i = 0; i < n; ++i) > x[i] = ...; > else > for (int i = 0; i < n; ++i) > x[i * stride] = ...; > > This is useful for both vector code and scalar code, and in some cases > can enable further optimisations like loop interchange or pattern > recognition. > > The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 > and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests > that regress. > > Sizewise, there's a 10% increase in .text for both 554.roms_r and > 465.tonto. That's obviously a lot, but in tonto's case it's because > the whole program is written using assumed-shape arrays and pointers, > so a large number of functions really do benefit from versioning. > roms likewise makes heavy use of assumed-shape arrays, and that > improvement in performance IMO justifies the code growth. > > The next biggest .text increase is 4.5% for 548.exchange2_r. I did see > a small (0.4%) speed improvement there, but although both 3-iteration runs > produced stable results, that might still be noise. There was a slightly > larger (non-noise) improvement for a 256-bit SVE model. > > 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but > without any noticeable improvement in performance. No other test grew > by more than 2%. > > Although the main SPEC beneficiaries are all Fortran tests, the > benchmarks we use for SVE also include some C and C++ tests that > benefit. > > Using -frepack-arrays gives the same benefits in many Fortran cases. > The problem is that using that option inappropriately can force a full > array copy for arguments that the function only reads once, and so it > isn't really something we can turn on by default. The new pass is > supposed to give most of the benefits of -frepack-arrays without > the risk of unnecessary repacking. > > The patch therefore enables the pass by default at -O3. > > Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install? I will give this a thorough review tomorror (sorry for the delay), but one question that comes up is how we avoid excessive versioning. Consider this pass versioning for stride1, then loop distribution versioning for aliasing and then vectorization versioning for alignment. Do we have any idea on how to track these versionings so we can eventually commonize paths that didn't get any followup optimization applied? Also consider the case where loop distribution applies transforms with versioning for aliasing to _both_ loop nests from your versioning. Maybe we can at least add some loop->number_of_times_versioned counter and some dumping/statistics to get an idea how big an issue this might be? Thanks, Richard. > Richard > > > 2018-10-24 Richard Sandiford > > gcc/ > * doc/invoke.texi (-fversion-loops-for-strides): Document > (loop-versioning-group-size, loop-versioning-max-inner-insns) > (loop-versioning-max-outer-insns): Document new --params. > * Makefile.in (OBJS): Add gimple-loop-versioning.o. > * common.opt (fversion-loops-for-strides): New option. > * opts.c (default_options_table): Enable fversion-loops-for-strides > at -O3. > * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) > (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) > (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. > * passes.def: Add pass_loop_versioning. > * timevar.def (TV_LOOP_VERSIONING): New time variable. > * tree-ssa-propagate.h > (substitute_and_fold_engine::substitute_and_fold): Add an optional > block parameter. > * tree-ssa-propagate.c > (substitute_and_fold_engine::substitute_and_fold): Likewise. > When passed, only walk blocks dominated by that block. > * tree-vrp.h (range_includes_p): Declare. > (range_includes_zero_p): Turn into an inline wrapper around > range_includes_p. > * tree-vrp.c (range_includes_p): New function, generalizing... > (range_includes_zero_p): ...this. > * tree-pass.h (make_pass_loop_versioning): Declare. > * gimple-loop-versioning.cc: New file. > > gcc/testsuite/ > * gcc.dg/loop-versioning-1.c: New test. > * gcc.dg/loop-versioning-10.c: Likewise. > * gcc.dg/loop-versioning-11.c: Likewise. > * gcc.dg/loop-versioning-2.c: Likewise. > * gcc.dg/loop-versioning-3.c: Likewise. > * gcc.dg/loop-versioning-4.c: Likewise. > * gcc.dg/loop-versioning-5.c: Likewise. > * gcc.dg/loop-versioning-6.c: Likewise. > * gcc.dg/loop-versioning-7.c: Likewise. > * gcc.dg/loop-versioning-8.c: Likewise. > *