Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 27, 2020 at 10:20:18AM +0200, Richard Biener wrote: > > How about "Var(flag_cunroll_grow_size) EnabledBy(funroll-loops || > > funroll-all-loops || fpeel-loops)" Or flag_cunroll_allow_grow_size? > > > > And then using this flags as: > > unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size > >|| optimize >= 3, true); > > > > And we do not need to enable this flag at -O2. > > Sure this works for me. Note I'd make funroll-loops enabled by > funroll-all-loops so you could simplify the above. But only do that on trunk? Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Richard Biener writes: > On Wed, May 27, 2020 at 6:36 AM Jiufu Guo wrote: >> >> Segher Boessenkool writes: >> >> > Hi! >> > >> > On Tue, May 26, 2020 at 08:58:13AM +0200, Richard Biener wrote: >> >> On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool >> >> wrote: >> >> > Yes, cunroll does not have its own option, and that is a problem. But >> >> > that is easy to fix! Either with an option, or just with params (the >> >> > option wouldn't do more than set params anyway?) >> >> >> >> Well, given coming up with different names for essentially the same >> >> transform is going to be challenging how about sth like >> >> >> >> -funroll-loops={early,late,static,dynamic}[insert better names here] >> > >> > User interface is hard :-) I think luckily we don't need to change >> > anything there yet, just have an internal flag? >> > >> > But complete unrolling is something quite different, so it should have >> > its own flag anyway (at least internally). >> > >> >> note there's also -fpeel-loops which may match the transform >> >> done on GIMPLE better? >> > >> > Peeling and unrolling are the same thing, if doing complete unrolling >> > (or complete peeling), followed by DCE in both cases. Peeling is a >> > nicer name here I think, yeah. >> > >> >> I'm not sure which are the technically >> >> correct terms for unrollings that elide the loop (the backedge). >> > >> > I don't know a better term than "complete", I don't remember ever seeing >> > something else either. >> >> How about "Var(flag_cunroll_grow_size) EnabledBy(funroll-loops || >> funroll-all-loops || fpeel-loops)" Or flag_cunroll_allow_grow_size? >> >> And then using this flags as: >> unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size >>|| optimize >= 3, true); >> >> And we do not need to enable this flag at -O2. > > Sure this works for me. Note I'd make funroll-loops enabled by > funroll-all-loops so you could simplify the above. After some checking, I did not use 'EnabledBy(funroll-all-loops)' for funroll-loops, because of special behavior on -fno-unroll-all-loops. Command line -fno-unroll-all-loops turns off flag_unroll_loops, if flag_unroll_loops is enabled implicitly(not through command line). While the orignal logic is: only possitive flag_unroll_all_loops overrides flag_unroll_loops. "if (flag_unroll_all_loops) flag_unroll_loops = 1;" in process_options. Thanks. Jiufu > > Richard. > >> Thanks for all your helpful comments again! >> >> Jiufu >> >> > >> >> We're doing such kind of unrolling even if we cannot statically >> >> decide which of a set of possible exits we take (and internally >> >> call that peeling, if we can statically decide we call it complete >> >> unrolling). >> > >> > "Peeling" is placing some copies of the loop before the loop; >> > "unrolling" is placing a few copies of the loop inside the loop body. >> > Does that match usage here? >> > >> >> The RTL side OTOH only performs classical unrolling, >> >> preserving the backedge with various strategies for the >> >> remaining iterations. >> > >> > And if you do complete unrolling that way, the backedge can be removed, >> > since it can be shown never to be taken. >> > >> >> As said, for the regression on the 10 branch with ppc I'd add >> >> [a hidden] flag that controls the RTL unroller, also set by >> >> -funroll-loops and triggered by the ppc specific heuristics. >> > >> > But the problem is in cunroll? This is so backwards... Because some >> > other transform abuses the unroller flags, adding a second level flag >> > with the same meaning :-( It will work for fixing the regression, >> > sure, and it is slightly less code as well. >> > >> > >> > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 27, 2020 at 6:36 AM Jiufu Guo wrote: > > Segher Boessenkool writes: > > > Hi! > > > > On Tue, May 26, 2020 at 08:58:13AM +0200, Richard Biener wrote: > >> On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool > >> wrote: > >> > Yes, cunroll does not have its own option, and that is a problem. But > >> > that is easy to fix! Either with an option, or just with params (the > >> > option wouldn't do more than set params anyway?) > >> > >> Well, given coming up with different names for essentially the same > >> transform is going to be challenging how about sth like > >> > >> -funroll-loops={early,late,static,dynamic}[insert better names here] > > > > User interface is hard :-) I think luckily we don't need to change > > anything there yet, just have an internal flag? > > > > But complete unrolling is something quite different, so it should have > > its own flag anyway (at least internally). > > > >> note there's also -fpeel-loops which may match the transform > >> done on GIMPLE better? > > > > Peeling and unrolling are the same thing, if doing complete unrolling > > (or complete peeling), followed by DCE in both cases. Peeling is a > > nicer name here I think, yeah. > > > >> I'm not sure which are the technically > >> correct terms for unrollings that elide the loop (the backedge). > > > > I don't know a better term than "complete", I don't remember ever seeing > > something else either. > > How about "Var(flag_cunroll_grow_size) EnabledBy(funroll-loops || > funroll-all-loops || fpeel-loops)" Or flag_cunroll_allow_grow_size? > > And then using this flags as: > unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size >|| optimize >= 3, true); > > And we do not need to enable this flag at -O2. Sure this works for me. Note I'd make funroll-loops enabled by funroll-all-loops so you could simplify the above. Richard. > Thanks for all your helpful comments again! > > Jiufu > > > > >> We're doing such kind of unrolling even if we cannot statically > >> decide which of a set of possible exits we take (and internally > >> call that peeling, if we can statically decide we call it complete > >> unrolling). > > > > "Peeling" is placing some copies of the loop before the loop; > > "unrolling" is placing a few copies of the loop inside the loop body. > > Does that match usage here? > > > >> The RTL side OTOH only performs classical unrolling, > >> preserving the backedge with various strategies for the > >> remaining iterations. > > > > And if you do complete unrolling that way, the backedge can be removed, > > since it can be shown never to be taken. > > > >> As said, for the regression on the 10 branch with ppc I'd add > >> [a hidden] flag that controls the RTL unroller, also set by > >> -funroll-loops and triggered by the ppc specific heuristics. > > > > But the problem is in cunroll? This is so backwards... Because some > > other transform abuses the unroller flags, adding a second level flag > > with the same meaning :-( It will work for fixing the regression, > > sure, and it is slightly less code as well. > > > > > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Segher Boessenkool writes: > Hi! > > On Tue, May 26, 2020 at 08:58:13AM +0200, Richard Biener wrote: >> On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool >> wrote: >> > Yes, cunroll does not have its own option, and that is a problem. But >> > that is easy to fix! Either with an option, or just with params (the >> > option wouldn't do more than set params anyway?) >> >> Well, given coming up with different names for essentially the same >> transform is going to be challenging how about sth like >> >> -funroll-loops={early,late,static,dynamic}[insert better names here] > > User interface is hard :-) I think luckily we don't need to change > anything there yet, just have an internal flag? > > But complete unrolling is something quite different, so it should have > its own flag anyway (at least internally). > >> note there's also -fpeel-loops which may match the transform >> done on GIMPLE better? > > Peeling and unrolling are the same thing, if doing complete unrolling > (or complete peeling), followed by DCE in both cases. Peeling is a > nicer name here I think, yeah. > >> I'm not sure which are the technically >> correct terms for unrollings that elide the loop (the backedge). > > I don't know a better term than "complete", I don't remember ever seeing > something else either. How about "Var(flag_cunroll_grow_size) EnabledBy(funroll-loops || funroll-all-loops || fpeel-loops)" Or flag_cunroll_allow_grow_size? And then using this flags as: unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size || optimize >= 3, true); And we do not need to enable this flag at -O2. Thanks for all your helpful comments again! Jiufu > >> We're doing such kind of unrolling even if we cannot statically >> decide which of a set of possible exits we take (and internally >> call that peeling, if we can statically decide we call it complete >> unrolling). > > "Peeling" is placing some copies of the loop before the loop; > "unrolling" is placing a few copies of the loop inside the loop body. > Does that match usage here? > >> The RTL side OTOH only performs classical unrolling, >> preserving the backedge with various strategies for the >> remaining iterations. > > And if you do complete unrolling that way, the backedge can be removed, > since it can be shown never to be taken. > >> As said, for the regression on the 10 branch with ppc I'd add >> [a hidden] flag that controls the RTL unroller, also set by >> -funroll-loops and triggered by the ppc specific heuristics. > > But the problem is in cunroll? This is so backwards... Because some > other transform abuses the unroller flags, adding a second level flag > with the same meaning :-( It will work for fixing the regression, > sure, and it is slightly less code as well. > > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Hi! On Tue, May 26, 2020 at 08:58:13AM +0200, Richard Biener wrote: > On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool > wrote: > > Yes, cunroll does not have its own option, and that is a problem. But > > that is easy to fix! Either with an option, or just with params (the > > option wouldn't do more than set params anyway?) > > Well, given coming up with different names for essentially the same > transform is going to be challenging how about sth like > > -funroll-loops={early,late,static,dynamic}[insert better names here] User interface is hard :-) I think luckily we don't need to change anything there yet, just have an internal flag? But complete unrolling is something quite different, so it should have its own flag anyway (at least internally). > note there's also -fpeel-loops which may match the transform > done on GIMPLE better? Peeling and unrolling are the same thing, if doing complete unrolling (or complete peeling), followed by DCE in both cases. Peeling is a nicer name here I think, yeah. > I'm not sure which are the technically > correct terms for unrollings that elide the loop (the backedge). I don't know a better term than "complete", I don't remember ever seeing something else either. > We're doing such kind of unrolling even if we cannot statically > decide which of a set of possible exits we take (and internally > call that peeling, if we can statically decide we call it complete > unrolling). "Peeling" is placing some copies of the loop before the loop; "unrolling" is placing a few copies of the loop inside the loop body. Does that match usage here? > The RTL side OTOH only performs classical unrolling, > preserving the backedge with various strategies for the > remaining iterations. And if you do complete unrolling that way, the backedge can be removed, since it can be shown never to be taken. > As said, for the regression on the 10 branch with ppc I'd add > [a hidden] flag that controls the RTL unroller, also set by > -funroll-loops and triggered by the ppc specific heuristics. But the problem is in cunroll? This is so backwards... Because some other transform abuses the unroller flags, adding a second level flag with the same meaning :-( It will work for fixing the regression, sure, and it is slightly less code as well. Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Richard Biener writes: > On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool > wrote: >> >> On Mon, May 25, 2020 at 02:39:54PM +0200, Richard Biener wrote: >> > On Fri, May 22, 2020 at 6:54 PM Segher Boessenkool >> > wrote: >> > > > The split above allows the "bug" to be fixed (even on the branch) >> > > > without introducing even more target specialities. >> > > >> > > So does any split. Or I don't see what you mean? >> > >> > Well, a split that does not affect behavior for non-ppc architectures >> > when the flags by users are unchanged. Because that allows >> > the ppc regression to be fixed on the branch. >> > >> > Then, on trunk, we can think of a better overall flag design. Note >> >> Oh, as just a (very) temporary thing, it is fine of course (it should >> say it is then though). >> >> > that cunroll/cunrolli are not controlled by a flag currently, they >> > are gated on optimize >= [2|3] - it's just that either -funroll-loops >> > or -fpeel-loops causes its heuristics to allow code-size growth >> > by its own metrics according to the unroll --params. >> > >> > So it's a bit difficult to retrofit the heuristic behavior onto new >> > flags unless we want to completely move that over to a --param >> > that may be gets adjusted by -funroll-loops. >> >> Yes, cunroll does not have its own option, and that is a problem. But >> that is easy to fix! Either with an option, or just with params (the >> option wouldn't do more than set params anyway?) > > Well, given coming up with different names for essentially the same > transform is going to be challenging how about sth like > > -funroll-loops={early,late,static,dynamic}[insert better names here] > > note there's also -fpeel-loops which may match the transform > done on GIMPLE better? I'm not sure which are the technically > correct terms for unrollings that elide the loop (the backedge). > We're doing such kind of unrolling even if we cannot statically > decide which of a set of possible exits we take (and internally > call that peeling, if we can statically decide we call it complete > unrolling). The RTL side OTOH only performs classical unrolling, > preserving the backedge with various strategies for the > remaining iterations. > > As said, for the regression on the 10 branch with ppc I'd add > [a hidden] flag that controls the RTL unroller, also set by > -funroll-loops and triggered by the ppc specific heuristics. This way would enable rtl unroller at -O2 instead enable -funroll-loops, and then cunroll will not unroll/peel the loop if there is potential size increasing. This could avoid the negative affect on the loop which mentioned in PR95018 code. While, I still hope to tune cunroll at -O2 for ppc (or general platforms) with keeping possitive and avoid negative affects. Yes, this may align with what Honza plan to do. BR, Jiufu > > Richard. > >> >> Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Mon, May 25, 2020 at 7:44 PM Segher Boessenkool wrote: > > On Mon, May 25, 2020 at 02:39:54PM +0200, Richard Biener wrote: > > On Fri, May 22, 2020 at 6:54 PM Segher Boessenkool > > wrote: > > > > The split above allows the "bug" to be fixed (even on the branch) > > > > without introducing even more target specialities. > > > > > > So does any split. Or I don't see what you mean? > > > > Well, a split that does not affect behavior for non-ppc architectures > > when the flags by users are unchanged. Because that allows > > the ppc regression to be fixed on the branch. > > > > Then, on trunk, we can think of a better overall flag design. Note > > Oh, as just a (very) temporary thing, it is fine of course (it should > say it is then though). > > > that cunroll/cunrolli are not controlled by a flag currently, they > > are gated on optimize >= [2|3] - it's just that either -funroll-loops > > or -fpeel-loops causes its heuristics to allow code-size growth > > by its own metrics according to the unroll --params. > > > > So it's a bit difficult to retrofit the heuristic behavior onto new > > flags unless we want to completely move that over to a --param > > that may be gets adjusted by -funroll-loops. > > Yes, cunroll does not have its own option, and that is a problem. But > that is easy to fix! Either with an option, or just with params (the > option wouldn't do more than set params anyway?) Well, given coming up with different names for essentially the same transform is going to be challenging how about sth like -funroll-loops={early,late,static,dynamic}[insert better names here] note there's also -fpeel-loops which may match the transform done on GIMPLE better? I'm not sure which are the technically correct terms for unrollings that elide the loop (the backedge). We're doing such kind of unrolling even if we cannot statically decide which of a set of possible exits we take (and internally call that peeling, if we can statically decide we call it complete unrolling). The RTL side OTOH only performs classical unrolling, preserving the backedge with various strategies for the remaining iterations. As said, for the regression on the 10 branch with ppc I'd add [a hidden] flag that controls the RTL unroller, also set by -funroll-loops and triggered by the ppc specific heuristics. Richard. > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Mon, May 25, 2020 at 02:39:54PM +0200, Richard Biener wrote: > On Fri, May 22, 2020 at 6:54 PM Segher Boessenkool > wrote: > > > The split above allows the "bug" to be fixed (even on the branch) > > > without introducing even more target specialities. > > > > So does any split. Or I don't see what you mean? > > Well, a split that does not affect behavior for non-ppc architectures > when the flags by users are unchanged. Because that allows > the ppc regression to be fixed on the branch. > > Then, on trunk, we can think of a better overall flag design. Note Oh, as just a (very) temporary thing, it is fine of course (it should say it is then though). > that cunroll/cunrolli are not controlled by a flag currently, they > are gated on optimize >= [2|3] - it's just that either -funroll-loops > or -fpeel-loops causes its heuristics to allow code-size growth > by its own metrics according to the unroll --params. > > So it's a bit difficult to retrofit the heuristic behavior onto new > flags unless we want to completely move that over to a --param > that may be gets adjusted by -funroll-loops. Yes, cunroll does not have its own option, and that is a problem. But that is easy to fix! Either with an option, or just with params (the option wouldn't do more than set params anyway?) Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Fri, May 22, 2020 at 6:54 PM Segher Boessenkool wrote: > > On Fri, May 22, 2020 at 01:22:10PM +0200, Richard Biener wrote: > > On Wed, May 20, 2020 at 10:37 PM Segher Boessenkool > > wrote: > > > > > > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: > > > > I think this is the wrong way to approach this. You're doing too many > > > > things at once. Try to fix the powerpc regression with the extra > > > > flag_rtl_unroll_loops, that could be backported. Then you can > > > > independently see whether enabling more unrolling at -O2 makes > > > > sense. Because currently we _do_ unroll at -O2 when it does > > > > not increase size. Its just your patches make this as aggressive > > > > as -O3. > > > > > > Just do a separate flag (and option) for cunroll, instead? > > > > > > The RTL unroller is *the* unroller, and has been since forever. > > > > Sorry but that ship has sailed - I don't think we should make > > -O2 -funroll-loops no longer affect GIMPLE. > > Of course, and it would certainly be good if we could do some non- > trivial unrolling earlier. My point is, why rename all RTL stuff > now? That is just churn. > > (And we will need to keep unrolling in RTL for a long time still, if > not forever). > > > But sure, what I am proposing is have > > > > -frtl-unroll-loops > > -fgimple-unroll-loops > > > > with obvious semantics and -funroll-loops enabling both. > > Users should not really care about what is RTL or GIMPLE. > > Yeah -- so this shouldn't be part of the name at all. What you care > about are higher level characteristics of the unrolling, like > "complete" or "early" or whatever. > > > The split above allows the "bug" to be fixed (even on the branch) > > without introducing even more target specialities. > > So does any split. Or I don't see what you mean? Well, a split that does not affect behavior for non-ppc architectures when the flags by users are unchanged. Because that allows the ppc regression to be fixed on the branch. Then, on trunk, we can think of a better overall flag design. Note that cunroll/cunrolli are not controlled by a flag currently, they are gated on optimize >= [2|3] - it's just that either -funroll-loops or -fpeel-loops causes its heuristics to allow code-size growth by its own metrics according to the unroll --params. So it's a bit difficult to retrofit the heuristic behavior onto new flags unless we want to completely move that over to a --param that may be gets adjusted by -funroll-loops. Richard. > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Fri, May 22, 2020 at 01:22:10PM +0200, Richard Biener wrote: > On Wed, May 20, 2020 at 10:37 PM Segher Boessenkool > wrote: > > > > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: > > > I think this is the wrong way to approach this. You're doing too many > > > things at once. Try to fix the powerpc regression with the extra > > > flag_rtl_unroll_loops, that could be backported. Then you can > > > independently see whether enabling more unrolling at -O2 makes > > > sense. Because currently we _do_ unroll at -O2 when it does > > > not increase size. Its just your patches make this as aggressive > > > as -O3. > > > > Just do a separate flag (and option) for cunroll, instead? > > > > The RTL unroller is *the* unroller, and has been since forever. > > Sorry but that ship has sailed - I don't think we should make > -O2 -funroll-loops no longer affect GIMPLE. Of course, and it would certainly be good if we could do some non- trivial unrolling earlier. My point is, why rename all RTL stuff now? That is just churn. (And we will need to keep unrolling in RTL for a long time still, if not forever). > But sure, what I am proposing is have > > -frtl-unroll-loops > -fgimple-unroll-loops > > with obvious semantics and -funroll-loops enabling both. > Users should not really care about what is RTL or GIMPLE. Yeah -- so this shouldn't be part of the name at all. What you care about are higher level characteristics of the unrolling, like "complete" or "early" or whatever. > The split above allows the "bug" to be fixed (even on the branch) > without introducing even more target specialities. So does any split. Or I don't see what you mean? Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 20, 2020 at 10:37 PM Segher Boessenkool wrote: > > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: > > I think this is the wrong way to approach this. You're doing too many > > things at once. Try to fix the powerpc regression with the extra > > flag_rtl_unroll_loops, that could be backported. Then you can > > independently see whether enabling more unrolling at -O2 makes > > sense. Because currently we _do_ unroll at -O2 when it does > > not increase size. Its just your patches make this as aggressive > > as -O3. > > Just do a separate flag (and option) for cunroll, instead? > > The RTL unroller is *the* unroller, and has been since forever. Sorry but that ship has sailed - I don't think we should make -O2 -funroll-loops no longer affect GIMPLE. But sure, what I am proposing is have -frtl-unroll-loops -fgimple-unroll-loops with obvious semantics and -funroll-loops enabling both. Users should not really care about what is RTL or GIMPLE. The split above allows the "bug" to be fixed (even on the branch) without introducing even more target specialities. Richard. > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Jiufu Guo writes: > Jan Hubicka writes: > >>> Segher Boessenkool writes: >>> >>> > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: >>> >> I think this is the wrong way to approach this. You're doing too many >>> >> things at once. Try to fix the powerpc regression with the extra >>> >> flag_rtl_unroll_loops, that could be backported. Then you can >>> >>> Or flag_complete_unroll_loops(-fcomplete-unroll-loops) for GIMPLE >>> cunroll? >>> >> independently see whether enabling more unrolling at -O2 makes >>> >> sense. Because currently we _do_ unroll at -O2 when it does >>> >> not increase size. Its just your patches make this as aggressive >>> >> as -O3. >>> >>> I'm also thinking about enabling more cunroll at -O2 even with some size >>> increasing. Full cunroll enablement make it like -O3. As some >>> discussion in PRs (e.g. PR88760), small/simple loops unrolling may be in >>> favor of some platforms (but not for all platforms, like x86_64?). This >>> would make us to have target specified hook. Or do some generic >>> setting: accept to unroll/peel limit times if the loop body is simple >>> and small, together with target specific hook. >> >> We now have --params that can be tuned differently for -O2 and -O3 so One thing about tunning --param based on optimization level, some times difference function may has different optimization level. While --params may not be set per function, if so, --param may not work as expect at some functions. Not sure if this is an issue you may concern about. Thanks! Jiufu >> looking into cunroll was one of my todo for GCC 10 -O2 retuning but i did >> not get any very conclusive benchmark results outside SPEC. >> I planned to return to it next stage1, so it may be good time. >> Do you have any benchmarks on ppc? > > 541.leela_r, 548.exchange2_r and 557.xz_r from SPEC2017 are visbily > affected by cunroll. They can be used to tune cunroll, I think. > >> Of couse there is no need to keep same defaults for all targets, but in >> general having target specific defaults increases number of knobs we >> need to check and keep up to date. > > Thanks, > Jiufu > >> >> Honza > >>> >>> Any comments? Thanks! >>> Jiufu >>> >>> > >>> > Just do a separate flag (and option) for cunroll, instead? >>> > >>> > The RTL unroller is *the* unroller, and has been since forever. >>> > >>> > >>> > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Jan Hubicka writes: >> Segher Boessenkool writes: >> >> > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: >> >> I think this is the wrong way to approach this. You're doing too many >> >> things at once. Try to fix the powerpc regression with the extra >> >> flag_rtl_unroll_loops, that could be backported. Then you can >> >> Or flag_complete_unroll_loops(-fcomplete-unroll-loops) for GIMPLE >> cunroll? >> >> independently see whether enabling more unrolling at -O2 makes >> >> sense. Because currently we _do_ unroll at -O2 when it does >> >> not increase size. Its just your patches make this as aggressive >> >> as -O3. >> >> I'm also thinking about enabling more cunroll at -O2 even with some size >> increasing. Full cunroll enablement make it like -O3. As some >> discussion in PRs (e.g. PR88760), small/simple loops unrolling may be in >> favor of some platforms (but not for all platforms, like x86_64?). This >> would make us to have target specified hook. Or do some generic >> setting: accept to unroll/peel limit times if the loop body is simple >> and small, together with target specific hook. > > We now have --params that can be tuned differently for -O2 and -O3 so > looking into cunroll was one of my todo for GCC 10 -O2 retuning but i did > not get any very conclusive benchmark results outside SPEC. > I planned to return to it next stage1, so it may be good time. > Do you have any benchmarks on ppc? 541.leela_r, 548.exchange2_r and 557.xz_r from SPEC2017 are visbily affected by cunroll. They can be used to tune cunroll, I think. > Of couse there is no need to keep same defaults for all targets, but in > general having target specific defaults increases number of knobs we > need to check and keep up to date. Thanks, Jiufu > > Honza >> >> Any comments? Thanks! >> Jiufu >> >> > >> > Just do a separate flag (and option) for cunroll, instead? >> > >> > The RTL unroller is *the* unroller, and has been since forever. >> > >> > >> > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
> Segher Boessenkool writes: > > > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: > >> I think this is the wrong way to approach this. You're doing too many > >> things at once. Try to fix the powerpc regression with the extra > >> flag_rtl_unroll_loops, that could be backported. Then you can > > Or flag_complete_unroll_loops(-fcomplete-unroll-loops) for GIMPLE > cunroll? > >> independently see whether enabling more unrolling at -O2 makes > >> sense. Because currently we _do_ unroll at -O2 when it does > >> not increase size. Its just your patches make this as aggressive > >> as -O3. > > I'm also thinking about enabling more cunroll at -O2 even with some size > increasing. Full cunroll enablement make it like -O3. As some > discussion in PRs (e.g. PR88760), small/simple loops unrolling may be in > favor of some platforms (but not for all platforms, like x86_64?). This > would make us to have target specified hook. Or do some generic > setting: accept to unroll/peel limit times if the loop body is simple > and small, together with target specific hook. We now have --params that can be tuned differently for -O2 and -O3 so looking into cunroll was one of my todo for GCC 10 -O2 retuning but i did not get any very conclusive benchmark results outside SPEC. I planned to return to it next stage1, so it may be good time. Do you have any benchmarks on ppc? Of couse there is no need to keep same defaults for all targets, but in general having target specific defaults increases number of knobs we need to check and keep up to date. Honza > > Any comments? Thanks! > Jiufu > > > > > Just do a separate flag (and option) for cunroll, instead? > > > > The RTL unroller is *the* unroller, and has been since forever. > > > > > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Segher Boessenkool writes: > On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: >> I think this is the wrong way to approach this. You're doing too many >> things at once. Try to fix the powerpc regression with the extra >> flag_rtl_unroll_loops, that could be backported. Then you can Or flag_complete_unroll_loops(-fcomplete-unroll-loops) for GIMPLE cunroll? >> independently see whether enabling more unrolling at -O2 makes >> sense. Because currently we _do_ unroll at -O2 when it does >> not increase size. Its just your patches make this as aggressive >> as -O3. I'm also thinking about enabling more cunroll at -O2 even with some size increasing. Full cunroll enablement make it like -O3. As some discussion in PRs (e.g. PR88760), small/simple loops unrolling may be in favor of some platforms (but not for all platforms, like x86_64?). This would make us to have target specified hook. Or do some generic setting: accept to unroll/peel limit times if the loop body is simple and small, together with target specific hook. Any comments? Thanks! Jiufu > > Just do a separate flag (and option) for cunroll, instead? > > The RTL unroller is *the* unroller, and has been since forever. > > > Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 20, 2020 at 12:30:30PM +0200, Richard Biener wrote: > I think this is the wrong way to approach this. You're doing too many > things at once. Try to fix the powerpc regression with the extra > flag_rtl_unroll_loops, that could be backported. Then you can > independently see whether enabling more unrolling at -O2 makes > sense. Because currently we _do_ unroll at -O2 when it does > not increase size. Its just your patches make this as aggressive > as -O3. Just do a separate flag (and option) for cunroll, instead? The RTL unroller is *the* unroller, and has been since forever. Segher
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 20, 2020 at 11:08 AM Jiufu Guo wrote: > > Richard Biener writes: > > > On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches > > wrote: > >> > >> Hi, > >> > >> In r10-4525, and r10-4161, loop unroller was enabled for simply loops at > >> -O2. > >> At the same time, the GIMPLE cunroll is also enabled, while it is not only > >> for > >> simple loops. This patch introduces a hook to check if a loop is suitable > >> to > >> unroll completely. The hook can be used to check if a loop is simply > >> enough > >> for complete unroll, then avoid negative effects like size increasing or > >> complex loop unrolling at -O2. > >> > >> This patch could help to handle the code in PR95018 which contains complex > >> loop, and does not cause obvious performance change on SPEC2017. > >> > >> Bootstrap/regtest pass on powerpc64le. OK for trunk? > > > > I fear you're just digging the rs6000 specific hole you dug up deeper > > with this ... > > I'm also thinking to add a check to cunroll directly without digging to > platforms. As patch below. This may not harm for platforms. > Is this acceptable? > > Thanks, > Jiufu > > diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c > index 8ab6ab3330c..67335b8a911 100644 > --- a/gcc/tree-ssa-loop-ivcanon.c > +++ b/gcc/tree-ssa-loop-ivcanon.c > @@ -1231,6 +1231,11 @@ canonicalize_loop_induction_variables (class loop > *loop, >/* Force re-computation of loop bounds so we can remove redundant exits. > */ >maxiter = max_loop_iterations_int (loop); > > + /* For optimize level 2, unroll and peel only for simple loops. */ > + if (allow_peel && optimize == 2 && maxiter >= 4 > + && !(TREE_CODE (niter) == INTEGER_CST && exit)) > +return false; > + I think this is the wrong way to approach this. You're doing too many things at once. Try to fix the powerpc regression with the extra flag_rtl_unroll_loops, that could be backported. Then you can independently see whether enabling more unrolling at -O2 makes sense. Because currently we _do_ unroll at -O2 when it does not increase size. Its just your patches make this as aggressive as -O3. Richard. >if (dump_file && (dump_flags & TDF_DETAILS) >&& TREE_CODE (niter) == INTEGER_CST) > { > > > > > > Richard. > > > >> Jiufu > >> > >> 2020-05-20 Jiufu Guo > >> > >> PR target/95018 > >> * target.def (loop_allow_unroll_completely_peel): New hook. > >> * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > >> * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New > >> hook. > >> * config/rs6000/rs6000.c > >> (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > >> (rs6000_loop_allow_unroll_completely_peel): New hook > >> implementation. > >> * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): > >> Call > >> hook loop_allow_unroll_completely_peel. > >> > >> > >> --- > >> gcc/config/rs6000/rs6000.c | 19 +++ > >> gcc/doc/tm.texi | 9 + > >> gcc/doc/tm.texi.in | 2 ++ > >> gcc/target.def | 12 > >> gcc/tree-ssa-loop-ivcanon.c | 5 + > >> 5 files changed, 47 insertions(+) > >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > >> index 8435bc15d72..a1a3f9cb583 100644 > >> --- a/gcc/config/rs6000/rs6000.c > >> +++ b/gcc/config/rs6000/rs6000.c > >> @@ -1433,6 +1433,9 @@ static const struct attribute_spec > >> rs6000_attribute_table[] = > >> #undef TARGET_LOOP_UNROLL_ADJUST > >> #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust > >> > >> +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > >> +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > >> rs6000_loop_allow_unroll_completely_peel > >> + > >> #undef TARGET_INIT_BUILTINS > >> #define TARGET_INIT_BUILTINS rs6000_init_builtins > >> #undef TARGET_BUILTIN_DECL > >> @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct > >> loop *loop) > >>return nunroll; > >> } > >> > >> +/* Implement targetm.loop_allow_unroll_completely_peel. */ > >> + > >> +static bool > >> +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree > >> niter, > >> + struct loop *loop) > >> +{ > >> + if (unroll_only_small_loops && optimize == 2) > >> +{ > >> + if (maxiter >= 4 > >> + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) > >> + return false; > >> +} > >> + > >> + return true; > >> +} > >> + > >> /* Handler for the Mathematical Acceleration Subsystem (mass) interface > >> to a > >> library with vectorized intrinsics. */ > >> > >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > >> index 6e7d9dc54a9..f7419872c2f 100644 > >> --- a/gcc/doc/tm.texi > >> +++ b/gcc/doc/tm.texi > >> @@ -11893,6 +11893,15 @@ is required only when the target has special > >> constraints like maximum > >> number of
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Richard Biener writes: > On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches > wrote: >> >> Hi, >> >> In r10-4525, and r10-4161, loop unroller was enabled for simply loops at -O2. >> At the same time, the GIMPLE cunroll is also enabled, while it is not only >> for >> simple loops. This patch introduces a hook to check if a loop is suitable to >> unroll completely. The hook can be used to check if a loop is simply enough >> for complete unroll, then avoid negative effects like size increasing or >> complex loop unrolling at -O2. >> >> This patch could help to handle the code in PR95018 which contains complex >> loop, and does not cause obvious performance change on SPEC2017. >> >> Bootstrap/regtest pass on powerpc64le. OK for trunk? > > I fear you're just digging the rs6000 specific hole you dug up deeper > with this ... I'm also thinking to add a check to cunroll directly without digging to platforms. As patch below. This may not harm for platforms. Is this acceptable? Thanks, Jiufu diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c index 8ab6ab3330c..67335b8a911 100644 --- a/gcc/tree-ssa-loop-ivcanon.c +++ b/gcc/tree-ssa-loop-ivcanon.c @@ -1231,6 +1231,11 @@ canonicalize_loop_induction_variables (class loop *loop, /* Force re-computation of loop bounds so we can remove redundant exits. */ maxiter = max_loop_iterations_int (loop); + /* For optimize level 2, unroll and peel only for simple loops. */ + if (allow_peel && optimize == 2 && maxiter >= 4 + && !(TREE_CODE (niter) == INTEGER_CST && exit)) +return false; + if (dump_file && (dump_flags & TDF_DETAILS) && TREE_CODE (niter) == INTEGER_CST) { > > Richard. > >> Jiufu >> >> 2020-05-20 Jiufu Guo >> >> PR target/95018 >> * target.def (loop_allow_unroll_completely_peel): New hook. >> * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. >> * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New >> hook. >> * config/rs6000/rs6000.c (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): >> New hook. >> (rs6000_loop_allow_unroll_completely_peel): New hook implementation. >> * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): >> Call >> hook loop_allow_unroll_completely_peel. >> >> >> --- >> gcc/config/rs6000/rs6000.c | 19 +++ >> gcc/doc/tm.texi | 9 + >> gcc/doc/tm.texi.in | 2 ++ >> gcc/target.def | 12 >> gcc/tree-ssa-loop-ivcanon.c | 5 + >> 5 files changed, 47 insertions(+) >> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 8435bc15d72..a1a3f9cb583 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ -1433,6 +1433,9 @@ static const struct attribute_spec >> rs6000_attribute_table[] = >> #undef TARGET_LOOP_UNROLL_ADJUST >> #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust >> >> +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> rs6000_loop_allow_unroll_completely_peel >> + >> #undef TARGET_INIT_BUILTINS >> #define TARGET_INIT_BUILTINS rs6000_init_builtins >> #undef TARGET_BUILTIN_DECL >> @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct >> loop *loop) >>return nunroll; >> } >> >> +/* Implement targetm.loop_allow_unroll_completely_peel. */ >> + >> +static bool >> +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree niter, >> + struct loop *loop) >> +{ >> + if (unroll_only_small_loops && optimize == 2) >> +{ >> + if (maxiter >= 4 >> + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) >> + return false; >> +} >> + >> + return true; >> +} >> + >> /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a >> library with vectorized intrinsics. */ >> >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi >> index 6e7d9dc54a9..f7419872c2f 100644 >> --- a/gcc/doc/tm.texi >> +++ b/gcc/doc/tm.texi >> @@ -11893,6 +11893,15 @@ is required only when the target has special >> constraints like maximum >> number of memory accesses. >> @end deftypefn >> >> +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) >> +This target hook check if the loop @var{loop} is suitable to unroll >> +completely on the target. The parameter @var{maxiter} is the possible max >> +bound of iterations. The parameter @var{niter} is the number expression of >> +iterations the loop is executed. The parameter @var{loop} is a pointer to >> +the loop. This target hook is required only when the target has special >> +constraints. >> +@end deftypefn >> + >> @defmac POWI_MAX_MULTS >> If defined, this macro is interpreted as a signed integer C expression >> that specifies the
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Richard Biener writes: > On Wed, May 20, 2020 at 10:27 AM Jiufu Guo wrote: >> >> Richard Biener writes: >> >> > On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches >> > wrote: >> >> >> >> Hi, >> >> >> >> In r10-4525, and r10-4161, loop unroller was enabled for simply loops at >> >> -O2. >> >> At the same time, the GIMPLE cunroll is also enabled, while it is not >> >> only for >> >> simple loops. This patch introduces a hook to check if a loop is >> >> suitable to >> >> unroll completely. The hook can be used to check if a loop is simply >> >> enough >> >> for complete unroll, then avoid negative effects like size increasing or >> >> complex loop unrolling at -O2. >> >> >> >> This patch could help to handle the code in PR95018 which contains complex >> >> loop, and does not cause obvious performance change on SPEC2017. >> >> >> >> Bootstrap/regtest pass on powerpc64le. OK for trunk? >> > >> > I fear you're just digging the rs6000 specific hole you dug up deeper >> > with this ... >> >> Yes, this is because r10-4525 and r10-4161 adjust RTL unroller hook for >> rs6000, and they have the effect to enable GIMPLE cunroll at O2 for all >> loops. So, I introduce this patch to adjust GIMPLE cunroll similarly. > > I think a better approach may be to leave GIMPLE unroll untouched by > changing the earlier patch to set a new flag_rtl_unroll_loops > (where -funroll-loops would enable -frtl-unroll-loops) and gate the RTL > unroller on flag_rtl_unroll_loops which you can enable separately > on rs6000. > > That is, not add even more rs6000 specialities but restrict the earlier > one to what it was supposed to do - the GIMPLE effects were clearly > not your goal. I also lean to enable GIMPLE cunroll for O2 (at that time only for rs6000), especially for those loops which are simple and not increase size too much. Thanks Richard. > > Richard. > >> Thanks, >> Jiufu >> >> > >> > Richard. >> > >> >> Jiufu >> >> >> >> 2020-05-20 Jiufu Guo >> >> >> >> PR target/95018 >> >> * target.def (loop_allow_unroll_completely_peel): New hook. >> >> * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New >> >> hook. >> >> * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New >> >> hook. >> >> * config/rs6000/rs6000.c >> >> (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. >> >> (rs6000_loop_allow_unroll_completely_peel): New hook >> >> implementation. >> >> * tree-ssa-loop-ivcanon.c >> >> (canonicalize_loop_induction_variables): Call >> >> hook loop_allow_unroll_completely_peel. >> >> >> >> >> >> --- >> >> gcc/config/rs6000/rs6000.c | 19 +++ >> >> gcc/doc/tm.texi | 9 + >> >> gcc/doc/tm.texi.in | 2 ++ >> >> gcc/target.def | 12 >> >> gcc/tree-ssa-loop-ivcanon.c | 5 + >> >> 5 files changed, 47 insertions(+) >> >> >> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> >> index 8435bc15d72..a1a3f9cb583 100644 >> >> --- a/gcc/config/rs6000/rs6000.c >> >> +++ b/gcc/config/rs6000/rs6000.c >> >> @@ -1433,6 +1433,9 @@ static const struct attribute_spec >> >> rs6000_attribute_table[] = >> >> #undef TARGET_LOOP_UNROLL_ADJUST >> >> #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust >> >> >> >> +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> >> +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> >> rs6000_loop_allow_unroll_completely_peel >> >> + >> >> #undef TARGET_INIT_BUILTINS >> >> #define TARGET_INIT_BUILTINS rs6000_init_builtins >> >> #undef TARGET_BUILTIN_DECL >> >> @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, >> >> struct loop *loop) >> >>return nunroll; >> >> } >> >> >> >> +/* Implement targetm.loop_allow_unroll_completely_peel. */ >> >> + >> >> +static bool >> >> +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree >> >> niter, >> >> + struct loop *loop) >> >> +{ >> >> + if (unroll_only_small_loops && optimize == 2) >> >> +{ >> >> + if (maxiter >= 4 >> >> + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) >> >> + return false; >> >> +} >> >> + >> >> + return true; >> >> +} >> >> + >> >> /* Handler for the Mathematical Acceleration Subsystem (mass) interface >> >> to a >> >> library with vectorized intrinsics. */ >> >> >> >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi >> >> index 6e7d9dc54a9..f7419872c2f 100644 >> >> --- a/gcc/doc/tm.texi >> >> +++ b/gcc/doc/tm.texi >> >> @@ -11893,6 +11893,15 @@ is required only when the target has special >> >> constraints like maximum >> >> number of memory accesses. >> >> @end deftypefn >> >> >> >> +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> >> (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) >> >> +This target hook check if the loop @var{loop} is suitable to unroll >> >>
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 20, 2020 at 10:27 AM Jiufu Guo wrote: > > Richard Biener writes: > > > On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches > > wrote: > >> > >> Hi, > >> > >> In r10-4525, and r10-4161, loop unroller was enabled for simply loops at > >> -O2. > >> At the same time, the GIMPLE cunroll is also enabled, while it is not only > >> for > >> simple loops. This patch introduces a hook to check if a loop is suitable > >> to > >> unroll completely. The hook can be used to check if a loop is simply > >> enough > >> for complete unroll, then avoid negative effects like size increasing or > >> complex loop unrolling at -O2. > >> > >> This patch could help to handle the code in PR95018 which contains complex > >> loop, and does not cause obvious performance change on SPEC2017. > >> > >> Bootstrap/regtest pass on powerpc64le. OK for trunk? > > > > I fear you're just digging the rs6000 specific hole you dug up deeper > > with this ... > > Yes, this is because r10-4525 and r10-4161 adjust RTL unroller hook for > rs6000, and they have the effect to enable GIMPLE cunroll at O2 for all > loops. So, I introduce this patch to adjust GIMPLE cunroll similarly. I think a better approach may be to leave GIMPLE unroll untouched by changing the earlier patch to set a new flag_rtl_unroll_loops (where -funroll-loops would enable -frtl-unroll-loops) and gate the RTL unroller on flag_rtl_unroll_loops which you can enable separately on rs6000. That is, not add even more rs6000 specialities but restrict the earlier one to what it was supposed to do - the GIMPLE effects were clearly not your goal. Richard. > Thanks, > Jiufu > > > > > Richard. > > > >> Jiufu > >> > >> 2020-05-20 Jiufu Guo > >> > >> PR target/95018 > >> * target.def (loop_allow_unroll_completely_peel): New hook. > >> * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > >> * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New > >> hook. > >> * config/rs6000/rs6000.c > >> (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > >> (rs6000_loop_allow_unroll_completely_peel): New hook > >> implementation. > >> * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): > >> Call > >> hook loop_allow_unroll_completely_peel. > >> > >> > >> --- > >> gcc/config/rs6000/rs6000.c | 19 +++ > >> gcc/doc/tm.texi | 9 + > >> gcc/doc/tm.texi.in | 2 ++ > >> gcc/target.def | 12 > >> gcc/tree-ssa-loop-ivcanon.c | 5 + > >> 5 files changed, 47 insertions(+) > >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > >> index 8435bc15d72..a1a3f9cb583 100644 > >> --- a/gcc/config/rs6000/rs6000.c > >> +++ b/gcc/config/rs6000/rs6000.c > >> @@ -1433,6 +1433,9 @@ static const struct attribute_spec > >> rs6000_attribute_table[] = > >> #undef TARGET_LOOP_UNROLL_ADJUST > >> #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust > >> > >> +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > >> +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > >> rs6000_loop_allow_unroll_completely_peel > >> + > >> #undef TARGET_INIT_BUILTINS > >> #define TARGET_INIT_BUILTINS rs6000_init_builtins > >> #undef TARGET_BUILTIN_DECL > >> @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct > >> loop *loop) > >>return nunroll; > >> } > >> > >> +/* Implement targetm.loop_allow_unroll_completely_peel. */ > >> + > >> +static bool > >> +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree > >> niter, > >> + struct loop *loop) > >> +{ > >> + if (unroll_only_small_loops && optimize == 2) > >> +{ > >> + if (maxiter >= 4 > >> + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) > >> + return false; > >> +} > >> + > >> + return true; > >> +} > >> + > >> /* Handler for the Mathematical Acceleration Subsystem (mass) interface > >> to a > >> library with vectorized intrinsics. */ > >> > >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > >> index 6e7d9dc54a9..f7419872c2f 100644 > >> --- a/gcc/doc/tm.texi > >> +++ b/gcc/doc/tm.texi > >> @@ -11893,6 +11893,15 @@ is required only when the target has special > >> constraints like maximum > >> number of memory accesses. > >> @end deftypefn > >> > >> +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > >> (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) > >> +This target hook check if the loop @var{loop} is suitable to unroll > >> +completely on the target. The parameter @var{maxiter} is the possible max > >> +bound of iterations. The parameter @var{niter} is the number expression of > >> +iterations the loop is executed. The parameter @var{loop} is a pointer to > >> +the loop. This target hook is required only when the target has special > >> +constraints. > >> +@end
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Richard Biener writes: > On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches > wrote: >> >> Hi, >> >> In r10-4525, and r10-4161, loop unroller was enabled for simply loops at -O2. >> At the same time, the GIMPLE cunroll is also enabled, while it is not only >> for >> simple loops. This patch introduces a hook to check if a loop is suitable to >> unroll completely. The hook can be used to check if a loop is simply enough >> for complete unroll, then avoid negative effects like size increasing or >> complex loop unrolling at -O2. >> >> This patch could help to handle the code in PR95018 which contains complex >> loop, and does not cause obvious performance change on SPEC2017. >> >> Bootstrap/regtest pass on powerpc64le. OK for trunk? > > I fear you're just digging the rs6000 specific hole you dug up deeper > with this ... Yes, this is because r10-4525 and r10-4161 adjust RTL unroller hook for rs6000, and they have the effect to enable GIMPLE cunroll at O2 for all loops. So, I introduce this patch to adjust GIMPLE cunroll similarly. Thanks, Jiufu > > Richard. > >> Jiufu >> >> 2020-05-20 Jiufu Guo >> >> PR target/95018 >> * target.def (loop_allow_unroll_completely_peel): New hook. >> * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. >> * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New >> hook. >> * config/rs6000/rs6000.c (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): >> New hook. >> (rs6000_loop_allow_unroll_completely_peel): New hook implementation. >> * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): >> Call >> hook loop_allow_unroll_completely_peel. >> >> >> --- >> gcc/config/rs6000/rs6000.c | 19 +++ >> gcc/doc/tm.texi | 9 + >> gcc/doc/tm.texi.in | 2 ++ >> gcc/target.def | 12 >> gcc/tree-ssa-loop-ivcanon.c | 5 + >> 5 files changed, 47 insertions(+) >> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 8435bc15d72..a1a3f9cb583 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ -1433,6 +1433,9 @@ static const struct attribute_spec >> rs6000_attribute_table[] = >> #undef TARGET_LOOP_UNROLL_ADJUST >> #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust >> >> +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> rs6000_loop_allow_unroll_completely_peel >> + >> #undef TARGET_INIT_BUILTINS >> #define TARGET_INIT_BUILTINS rs6000_init_builtins >> #undef TARGET_BUILTIN_DECL >> @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct >> loop *loop) >>return nunroll; >> } >> >> +/* Implement targetm.loop_allow_unroll_completely_peel. */ >> + >> +static bool >> +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree niter, >> + struct loop *loop) >> +{ >> + if (unroll_only_small_loops && optimize == 2) >> +{ >> + if (maxiter >= 4 >> + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) >> + return false; >> +} >> + >> + return true; >> +} >> + >> /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a >> library with vectorized intrinsics. */ >> >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi >> index 6e7d9dc54a9..f7419872c2f 100644 >> --- a/gcc/doc/tm.texi >> +++ b/gcc/doc/tm.texi >> @@ -11893,6 +11893,15 @@ is required only when the target has special >> constraints like maximum >> number of memory accesses. >> @end deftypefn >> >> +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) >> +This target hook check if the loop @var{loop} is suitable to unroll >> +completely on the target. The parameter @var{maxiter} is the possible max >> +bound of iterations. The parameter @var{niter} is the number expression of >> +iterations the loop is executed. The parameter @var{loop} is a pointer to >> +the loop. This target hook is required only when the target has special >> +constraints. >> +@end deftypefn >> + >> @defmac POWI_MAX_MULTS >> If defined, this macro is interpreted as a signed integer C expression >> that specifies the maximum number of floating point multiplications >> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in >> index 3be984bbd5c..4ae079a650c 100644 >> --- a/gcc/doc/tm.texi.in >> +++ b/gcc/doc/tm.texi.in >> @@ -8010,6 +8010,8 @@ lists. >> >> @hook TARGET_LOOP_UNROLL_ADJUST >> >> +@hook TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL >> + >> @defmac POWI_MAX_MULTS >> If defined, this macro is interpreted as a signed integer C expression >> that specifies the maximum number of floating point multiplications >> diff --git a/gcc/target.def b/gcc/target.def >> index 07059a87caf..3842ba611a2 100644 >> --- a/gcc/target.def >>
Re: [PATCH 1/2] rs6000: tune cunroll for simple loops at O2
On Wed, May 20, 2020 at 5:56 AM Jiufu Guo via Gcc-patches wrote: > > Hi, > > In r10-4525, and r10-4161, loop unroller was enabled for simply loops at -O2. > At the same time, the GIMPLE cunroll is also enabled, while it is not only for > simple loops. This patch introduces a hook to check if a loop is suitable to > unroll completely. The hook can be used to check if a loop is simply enough > for complete unroll, then avoid negative effects like size increasing or > complex loop unrolling at -O2. > > This patch could help to handle the code in PR95018 which contains complex > loop, and does not cause obvious performance change on SPEC2017. > > Bootstrap/regtest pass on powerpc64le. OK for trunk? I fear you're just digging the rs6000 specific hole you dug up deeper with this ... Richard. > Jiufu > > 2020-05-20 Jiufu Guo > > PR target/95018 > * target.def (loop_allow_unroll_completely_peel): New hook. > * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. > * config/rs6000/rs6000.c (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): > New hook. > (rs6000_loop_allow_unroll_completely_peel): New hook implementation. > * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): > Call > hook loop_allow_unroll_completely_peel. > > > --- > gcc/config/rs6000/rs6000.c | 19 +++ > gcc/doc/tm.texi | 9 + > gcc/doc/tm.texi.in | 2 ++ > gcc/target.def | 12 > gcc/tree-ssa-loop-ivcanon.c | 5 + > 5 files changed, 47 insertions(+) > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 8435bc15d72..a1a3f9cb583 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ -1433,6 +1433,9 @@ static const struct attribute_spec > rs6000_attribute_table[] = > #undef TARGET_LOOP_UNROLL_ADJUST > #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust > > +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > rs6000_loop_allow_unroll_completely_peel > + > #undef TARGET_INIT_BUILTINS > #define TARGET_INIT_BUILTINS rs6000_init_builtins > #undef TARGET_BUILTIN_DECL > @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct > loop *loop) >return nunroll; > } > > +/* Implement targetm.loop_allow_unroll_completely_peel. */ > + > +static bool > +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree niter, > + struct loop *loop) > +{ > + if (unroll_only_small_loops && optimize == 2) > +{ > + if (maxiter >= 4 > + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) > + return false; > +} > + > + return true; > +} > + > /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a > library with vectorized intrinsics. */ > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index 6e7d9dc54a9..f7419872c2f 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -11893,6 +11893,15 @@ is required only when the target has special > constraints like maximum > number of memory accesses. > @end deftypefn > > +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) > +This target hook check if the loop @var{loop} is suitable to unroll > +completely on the target. The parameter @var{maxiter} is the possible max > +bound of iterations. The parameter @var{niter} is the number expression of > +iterations the loop is executed. The parameter @var{loop} is a pointer to > +the loop. This target hook is required only when the target has special > +constraints. > +@end deftypefn > + > @defmac POWI_MAX_MULTS > If defined, this macro is interpreted as a signed integer C expression > that specifies the maximum number of floating point multiplications > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in > index 3be984bbd5c..4ae079a650c 100644 > --- a/gcc/doc/tm.texi.in > +++ b/gcc/doc/tm.texi.in > @@ -8010,6 +8010,8 @@ lists. > > @hook TARGET_LOOP_UNROLL_ADJUST > > +@hook TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL > + > @defmac POWI_MAX_MULTS > If defined, this macro is interpreted as a signed integer C expression > that specifies the maximum number of floating point multiplications > diff --git a/gcc/target.def b/gcc/target.def > index 07059a87caf..3842ba611a2 100644 > --- a/gcc/target.def > +++ b/gcc/target.def > @@ -2709,6 +2709,18 @@ number of memory accesses.", > unsigned, (unsigned nunroll, class loop *loop), > NULL) > > +/* Return if a loop is suitable for complete unroll. */ > +DEFHOOK > +(loop_allow_unroll_completely_peel, > + "This target hook check if the loop @var{loop} is suitable to unroll\n\ > +completely on the target. The parameter @var{maxiter} is the possible
[PATCH 1/2] rs6000: tune cunroll for simple loops at O2
Hi, In r10-4525, and r10-4161, loop unroller was enabled for simply loops at -O2. At the same time, the GIMPLE cunroll is also enabled, while it is not only for simple loops. This patch introduces a hook to check if a loop is suitable to unroll completely. The hook can be used to check if a loop is simply enough for complete unroll, then avoid negative effects like size increasing or complex loop unrolling at -O2. This patch could help to handle the code in PR95018 which contains complex loop, and does not cause obvious performance change on SPEC2017. Bootstrap/regtest pass on powerpc64le. OK for trunk? Jiufu 2020-05-20 Jiufu Guo PR target/95018 * target.def (loop_allow_unroll_completely_peel): New hook. * doc/tm.texi (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. * doc/tm.texi.in (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. * config/rs6000/rs6000.c (TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL): New hook. (rs6000_loop_allow_unroll_completely_peel): New hook implementation. * tree-ssa-loop-ivcanon.c (canonicalize_loop_induction_variables): Call hook loop_allow_unroll_completely_peel. --- gcc/config/rs6000/rs6000.c | 19 +++ gcc/doc/tm.texi | 9 + gcc/doc/tm.texi.in | 2 ++ gcc/target.def | 12 gcc/tree-ssa-loop-ivcanon.c | 5 + 5 files changed, 47 insertions(+) diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 8435bc15d72..a1a3f9cb583 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1433,6 +1433,9 @@ static const struct attribute_spec rs6000_attribute_table[] = #undef TARGET_LOOP_UNROLL_ADJUST #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust +#undef TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL +#define TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL rs6000_loop_allow_unroll_completely_peel + #undef TARGET_INIT_BUILTINS #define TARGET_INIT_BUILTINS rs6000_init_builtins #undef TARGET_BUILTIN_DECL @@ -5139,6 +5142,22 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop) return nunroll; } +/* Implement targetm.loop_allow_unroll_completely_peel. */ + +static bool +rs6000_loop_allow_unroll_completely_peel (HOST_WIDE_INT maxiter, tree niter, + struct loop *loop) +{ + if (unroll_only_small_loops && optimize == 2) +{ + if (maxiter >= 4 + && !(TREE_CODE (niter) == INTEGER_CST && single_exit (loop))) + return false; +} + + return true; +} + /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a library with vectorized intrinsics. */ diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 6e7d9dc54a9..f7419872c2f 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -11893,6 +11893,15 @@ is required only when the target has special constraints like maximum number of memory accesses. @end deftypefn +@deftypefn {Target Hook} bool TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL (HOST_WIDE_INT @var{maxiter}, tree @var{niter}, class loop *@var{loop}) +This target hook check if the loop @var{loop} is suitable to unroll +completely on the target. The parameter @var{maxiter} is the possible max +bound of iterations. The parameter @var{niter} is the number expression of +iterations the loop is executed. The parameter @var{loop} is a pointer to +the loop. This target hook is required only when the target has special +constraints. +@end deftypefn + @defmac POWI_MAX_MULTS If defined, this macro is interpreted as a signed integer C expression that specifies the maximum number of floating point multiplications diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 3be984bbd5c..4ae079a650c 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -8010,6 +8010,8 @@ lists. @hook TARGET_LOOP_UNROLL_ADJUST +@hook TARGET_LOOP_ALLOW_UNROLL_COMPLETELY_PEEL + @defmac POWI_MAX_MULTS If defined, this macro is interpreted as a signed integer C expression that specifies the maximum number of floating point multiplications diff --git a/gcc/target.def b/gcc/target.def index 07059a87caf..3842ba611a2 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -2709,6 +2709,18 @@ number of memory accesses.", unsigned, (unsigned nunroll, class loop *loop), NULL) +/* Return if a loop is suitable for complete unroll. */ +DEFHOOK +(loop_allow_unroll_completely_peel, + "This target hook check if the loop @var{loop} is suitable to unroll\n\ +completely on the target. The parameter @var{maxiter} is the possible max\n\ +bound of iterations. The parameter @var{niter} is the number expression of\n\ +iterations the loop is executed. The parameter @var{loop} is a pointer to\n\ +the loop. This target hook is required only when the target has special\n\ +constraints.", + bool, (HOST_WIDE_INT maxiter, tree niter, class loop *loop), + NULL) + /* True if X is a legitimate MODE-mode immediate operand. */