Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2021-01-21 Thread Jan Hubicka
> Hi All,
> 
> James and I have been investigating this regression and have tracked it down 
> to register allocation.
> 
> I have create a new PR with our findings 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately
> we don't know how to proceed.
> 
> This does seem like a genuine bug in RA.  It looks like some magic threshold 
> has been crossed, but we're having
> trouble determining what this magic number is.

Thank you for the analysis - it was on my TODO list for very long
time, but the function is large.  I will read it carefully and lets see
if we can come up with something useful.  

Honza
> 
> Any help is appreciated.
> 
> Thanks,
> Tamar
> 
> > -Original Message-
> > From: Xionghu Luo 
> > Sent: Friday, October 16, 2020 9:47 AM
> > To: Tamar Christina ; Martin Jambor
> > ; Richard Sandiford ;
> > luoxhu via Gcc-patches 
> > Cc: seg...@kernel.crashing.org; wschm...@linux.ibm.com;
> > li...@gcc.gnu.org; Jan Hubicka ; dje....@gmail.com
> > Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive
> > calls
> > 
> > 
> > 
> > On 2020/9/12 01:36, Tamar Christina wrote:
> > > Hi Martin,
> > >
> > >>
> > >> can you please confirm that the difference between these two is all
> > >> due to the last option -fno-inline-functions-called-once ?  Is LTo
> > >> necessary?  I.e., can you run the benchmark also built with the
> > >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno-
> > inline-functions-called-once ?
> > >>
> > >
> > > Done, see below.
> > >
> > >>> +--+
> > >>> +--+--
> > >> +--+--+-
> > -+
> > >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > >> | -24% |  |  |
> > >>> +--+
> > >>> +--+--
> > >> +--+--+-
> > -+
> > >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > >> | -26% |  |  |
> > >>> +--+
> > >>> +--+--
> > >> +--+--+-
> > -+
> > >>
> > >>>
> > >>> (Hopefully the table shows up correct)
> > >>
> > >> it does show OK for me, thanks.
> > >>
> > >>>
> > >>> It looks like your patch definitely does improve the basic cases. So
> > >>> there's not much difference between lto and non-lto anymore and it's
> > >> much Better than GCC 10. However it still contains the regression
> > >> introduced by Honza's changes.
> > >>
> > >> I assume these are rates, not times, so negative means bad.  But do I
> > >> understand it correctly that you're comparing against GCC 10 with the
> > >> two parameters set to rather special values?  Because your table
> > >> seems to indicate that even for you, the branch is faster than GCC 10
> > >> with just - mcpu=native -Ofast -fomit-frame-pointer.
> > >
> > > Yes these are indeed rates, and indeed I am comparing against the same
> > > options we used to get the fastest rates on before which is the two
> > > parameters and the inline flag.
> > >
> > >>
> > >> So is the problem that the best obtainable run-time, even with
> > >> obscure options, from the branch is slower than the best obtainable
> > >> run-time from GCC 10?
> > >>
> > >
> > > Yeah that's the problem, when we compare the two we're still behind.
> > >
> > > I've done the additional two runs
> > >
> > > +--+--
> > +--+
> > > | Compiler | Flags
> > | diff GCC 10  |
> > > +--+---

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2021-01-21 Thread Tamar Christina via Gcc-patches
Hi All,

James and I have been investigating this regression and have tracked it down to 
register allocation.

I have create a new PR with our findings 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 but unfortunately
we don't know how to proceed.

This does seem like a genuine bug in RA.  It looks like some magic threshold 
has been crossed, but we're having
trouble determining what this magic number is.

Any help is appreciated.

Thanks,
Tamar

> -Original Message-
> From: Xionghu Luo 
> Sent: Friday, October 16, 2020 9:47 AM
> To: Tamar Christina ; Martin Jambor
> ; Richard Sandiford ;
> luoxhu via Gcc-patches 
> Cc: seg...@kernel.crashing.org; wschm...@linux.ibm.com;
> li...@gcc.gnu.org; Jan Hubicka ; dje....@gmail.com
> Subject: Re: [PATCH] ipa-inline: Improve growth accumulation for recursive
> calls
> 
> 
> 
> On 2020/9/12 01:36, Tamar Christina wrote:
> > Hi Martin,
> >
> >>
> >> can you please confirm that the difference between these two is all
> >> due to the last option -fno-inline-functions-called-once ?  Is LTo
> >> necessary?  I.e., can you run the benchmark also built with the
> >> branch compiler and -mcpu=native -Ofast -fomit-frame-pointer -fno-
> inline-functions-called-once ?
> >>
> >
> > Done, see below.
> >
> >>> +--+
> >>> +--+--
> >> +--+--+-
> -+
> >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> >> | -24% |  |  |
> >>> +--+
> >>> +--+--
> >> +--+--+-
> -+
> >>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> >> | -26% |  |  |
> >>> +--+
> >>> +--+--
> >> +--+--+-
> -+
> >>
> >>>
> >>> (Hopefully the table shows up correct)
> >>
> >> it does show OK for me, thanks.
> >>
> >>>
> >>> It looks like your patch definitely does improve the basic cases. So
> >>> there's not much difference between lto and non-lto anymore and it's
> >> much Better than GCC 10. However it still contains the regression
> >> introduced by Honza's changes.
> >>
> >> I assume these are rates, not times, so negative means bad.  But do I
> >> understand it correctly that you're comparing against GCC 10 with the
> >> two parameters set to rather special values?  Because your table
> >> seems to indicate that even for you, the branch is faster than GCC 10
> >> with just - mcpu=native -Ofast -fomit-frame-pointer.
> >
> > Yes these are indeed rates, and indeed I am comparing against the same
> > options we used to get the fastest rates on before which is the two
> > parameters and the inline flag.
> >
> >>
> >> So is the problem that the best obtainable run-time, even with
> >> obscure options, from the branch is slower than the best obtainable
> >> run-time from GCC 10?
> >>
> >
> > Yeah that's the problem, when we compare the two we're still behind.
> >
> > I've done the additional two runs
> >
> > +--+--
> +--+
> > | Compiler | Flags
> | diff GCC 10  |
> > +--+--
> +--+
> > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param ipa-cp-
> eval-threshold=1 --param   ipa-cp-unit-growth=80 -fno-inline-functions-
> called-once |  |
> > +--+--
> +--+
> > | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
> | -44% |
> > +--+--
> +--+
> > | GCC 10   | -mcpu=native -Ofast 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-10-16 Thread Xionghu Luo via Gcc-patches



On 2020/9/12 01:36, Tamar Christina wrote:
> Hi Martin,
> 
>>
>> can you please confirm that the difference between these two is all due to
>> the last option -fno-inline-functions-called-once ?  Is LTo necessary?  
>> I.e., can
>> you run the benchmark also built with the branch compiler and -mcpu=native
>> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
>>
> 
> Done, see below.
> 
>>> +--+--
>> +--+--+--+
>>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>> | -24% |  |  |
>>> +--+--
>> +--+--+--+
>>> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
>> | -26% |  |  |
>>> +--+--
>> +--+--+--+
>>
>>>
>>> (Hopefully the table shows up correct)
>>
>> it does show OK for me, thanks.
>>
>>>
>>> It looks like your patch definitely does improve the basic cases. So
>>> there's not much difference between lto and non-lto anymore and it's
>> much Better than GCC 10. However it still contains the regression introduced
>> by Honza's changes.
>>
>> I assume these are rates, not times, so negative means bad.  But do I
>> understand it correctly that you're comparing against GCC 10 with the two
>> parameters set to rather special values?  Because your table seems to
>> indicate that even for you, the branch is faster than GCC 10 with just -
>> mcpu=native -Ofast -fomit-frame-pointer.
> 
> Yes these are indeed rates, and indeed I am comparing against the same options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
> 
>>
>> So is the problem that the best obtainable run-time, even with obscure
>> options, from the branch is slower than the best obtainable run-time from
>> GCC 10?
>>
> 
> Yeah that's the problem, when we compare the two we're still behind.
> 
> I've done the additional two runs
> 
> +--+--+--+
> | Compiler | Flags
>   
>   | diff GCC 10  |
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once |  |
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer 
>   
>   | -44% |
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto   
>   
>   | -36% |
> +--+--+--+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12% |
> +--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80   
> | -22% |
> +--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12% |
> 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-09-16 Thread Hongyu Wang via Gcc-patches
Tamar Christina  于2020年9月12日周六 上午1:39写道:

> Hi Martin,
>
> >
> > can you please confirm that the difference between these two is all due
> to
> > the last option -fno-inline-functions-called-once ?  Is LTo necessary?
> I.e., can
> > you run the benchmark also built with the branch compiler and
> -mcpu=native
> > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> >
>
> Done, see below.
>
> > >
> +--+--
> >
> +--+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -24% |  |  |
> > >
> +--+--
> >
> +--+--+--+
> > > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> > | -26% |  |  |
> > >
> +--+--
> >
> +--+--+--+
> >
> > >
> > > (Hopefully the table shows up correct)
> >
> > it does show OK for me, thanks.
> >
> > >
> > > It looks like your patch definitely does improve the basic cases. So
> > > there's not much difference between lto and non-lto anymore and it's
> > much Better than GCC 10. However it still contains the regression
> introduced
> > by Honza's changes.
> >
> > I assume these are rates, not times, so negative means bad.  But do I
> > understand it correctly that you're comparing against GCC 10 with the two
> > parameters set to rather special values?  Because your table seems to
> > indicate that even for you, the branch is faster than GCC 10 with just -
> > mcpu=native -Ofast -fomit-frame-pointer.
>
> Yes these are indeed rates, and indeed I am comparing against the same
> options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
>
> >
> > So is the problem that the best obtainable run-time, even with obscure
> > options, from the branch is slower than the best obtainable run-time from
> > GCC 10?
> >
>
> Yeah that's the problem, when we compare the two we're still behind.
>
> I've done the additional two runs
>
>
> +--+--+--+
> | Compiler | Flags
>
> | diff GCC 10  |
>
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once |  |
>
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer
>
>| -44% |
>
> +--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>| -36% |
>
> +--+--+--+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12% |
>
> +--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
>| -22% |
>
> +--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12% |
>
> +--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
>| -24% |
>
> +--+--+--+
> | Branch   

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-09-11 Thread Tamar Christina
Hi Martin,

> 
> can you please confirm that the difference between these two is all due to
> the last option -fno-inline-functions-called-once ?  Is LTo necessary?  I.e., 
> can
> you run the benchmark also built with the branch compiler and -mcpu=native
> -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> 

Done, see below.

> > +--+--
> +--+--+--+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto
> | -24% |  |  |
> > +--+--
> +--+--+--+
> > | Branch   | -mcpu=native -Ofast -fomit-frame-pointer
> | -26% |  |  |
> > +--+--
> +--+--+--+
> 
> >
> > (Hopefully the table shows up correct)
> 
> it does show OK for me, thanks.
> 
> >
> > It looks like your patch definitely does improve the basic cases. So
> > there's not much difference between lto and non-lto anymore and it's
> much Better than GCC 10. However it still contains the regression introduced
> by Honza's changes.
> 
> I assume these are rates, not times, so negative means bad.  But do I
> understand it correctly that you're comparing against GCC 10 with the two
> parameters set to rather special values?  Because your table seems to
> indicate that even for you, the branch is faster than GCC 10 with just -
> mcpu=native -Ofast -fomit-frame-pointer.

Yes these are indeed rates, and indeed I am comparing against the same options
we used to get the fastest rates on before which is the two parameters and
the inline flag.

> 
> So is the problem that the best obtainable run-time, even with obscure
> options, from the branch is slower than the best obtainable run-time from
> GCC 10?
> 

Yeah that's the problem, when we compare the two we're still behind.

I've done the additional two runs

+--+--+--+
| Compiler | Flags  
  | 
diff GCC 10  |
+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once |  |
+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer   
  | 
-44% |
+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto 
  | 
-36% |
+--+--+--+
| GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12% |
+--+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
  | -22% |
+--+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12% |
+--+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto 
 

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-09-11 Thread Martin Jambor
Hi,

On Fri, Sep 11 2020, Tamar Christina wrote:
> Hi Martin,
>
>> On Fri, Aug 21 2020, Tamar Christina wrote:
>> >>
>> >> Honza's changes have been motivated to big extent as an enabler for
>> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
>> >>
>> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
>> two
>> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
>> >> patch below it is just 276 seconds - faster than one built with GCC 8
>> >> which needs
>> >> 283 seconds.
>> >>
>> >> I'll be interested in knowing if it also works this well on other 
>> >> architectures.
>> >>
>> 
>> I have posted the new version of the patch series to the mailing list
>> yesterday and I have also pushed the branch to the FSF repo as
>> refs/users/jamborm/heads/ipa-context_and_exchange-200907
>> 
>
> Thanks! I've pushed it through our CI with a host of different options (see 
> below)
>
>> >
>> > Many thanks for working on this!
>> >
>> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
>> difference.
>> > Do I need any flags for it to work? The patch was applied on top of
>> > 656218ab982cc22b826227045826c92743143af1
>> >
>> 
>> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
>> support some interesting Aarch64 ISA extensions but I can measure a
>> significant speedup on it (everything with just -Ofast -march=native -
>> mtune=native, no non-default parameters, without LTO, without any inlining
>> options):
>> 
>>   GCC 10 branch:  915 seconds
>>   Master (rev. 995bb851ffe):  989 seconds
>>   My branch:  827 seconds
>> 
>> (All is 548.exchange_r reference run time.)
>> 
>
> +--+--+--+--+--+
> | Compiler | Flags
>   
>   | diff GCC 10  |  |  |
> +--+--+--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once |  |  |  |
> +--+--+--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer 
>   
>   | -44% |  |  |
> +--+--+--+--+--+
> | GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto   
>   
>   | -36% |  |  |
> +--+--+--+--+--+
> | GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12% |  |  |
> +--+--+--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80   
> | -22% |  |  |
> +--+--+--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once | -12% |  |  |

can you please confirm that the difference between these two is all due
to the last option -fno-inline-functions-called-once ?  Is LTo
necessary?  I.e., can you run the benchmark also built with the branch
compiler and
-mcpu=native -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?

> +--+--+--+--+--+
> | Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto   
>   
>   | -24% |  |  |
> 

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-09-11 Thread Tamar Christina
Hi Martin,

> -Original Message-
> From: Martin Jambor 
> Sent: Tuesday, September 8, 2020 3:01 PM
> To: Tamar Christina ; Richard Sandiford
> ; luoxhu via Gcc-patches  patc...@gcc.gnu.org>
> Cc: seg...@kernel.crashing.org; luoxhu ;
> wschm...@linux.ibm.com; li...@gcc.gnu.org; Jan Hubicka
> ; dje....@gmail.com
> Subject: RE: [PATCH] ipa-inline: Improve growth accumulation for recursive
> calls
> 
> Hi,
> 
> On Fri, Aug 21 2020, Tamar Christina wrote:
> >>
> >> Honza's changes have been motivated to big extent as an enabler for
> >> IPA-CP heuristics changes to actually speed up 548.exchange2_r.
> >>
> >> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds
> two
> >> weeks ago, this week it is 403, but with my WIP (and so far untested)
> >> patch below it is just 276 seconds - faster than one built with GCC 8
> >> which needs
> >> 283 seconds.
> >>
> >> I'll be interested in knowing if it also works this well on other 
> >> architectures.
> >>
> 
> I have posted the new version of the patch series to the mailing list
> yesterday and I have also pushed the branch to the FSF repo as
> refs/users/jamborm/heads/ipa-context_and_exchange-200907
> 

Thanks! I've pushed it through our CI with a host of different options (see 
below)

> >
> > Many thanks for working on this!
> >
> > I tried this on an AArch64 Neoverse-N1 machine and didn't see any
> difference.
> > Do I need any flags for it to work? The patch was applied on top of
> > 656218ab982cc22b826227045826c92743143af1
> >
> 
> I only have access to fairly old AMD (Seattle) Opteron 1100 which might not
> support some interesting Aarch64 ISA extensions but I can measure a
> significant speedup on it (everything with just -Ofast -march=native -
> mtune=native, no non-default parameters, without LTO, without any inlining
> options):
> 
>   GCC 10 branch:  915 seconds
>   Master (rev. 995bb851ffe):  989 seconds
>   My branch:  827 seconds
> 
> (All is 548.exchange_r reference run time.)
> 

+--+--+--+--+--+
| Compiler | Flags  
  | 
diff GCC 10  |  |  |
+--+--+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once |  |  |  |
+--+--+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer   
  | 
-44% |  |  |
+--+--+--+--+--+
| GCC 10   | -mcpu=native -Ofast -fomit-frame-pointer -flto 
  | 
-36% |  |  |
+--+--+--+--+--+
| GCC 11   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12% |  |  |
+--+--+--+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
  | -22% |  |  |
+--+--+--+--+--+
| Branch   | -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param   ipa-cp-unit-growth=80 
-fno-inline-functions-called-once | -12% |  |  |
+--+--

RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-09-08 Thread Martin Jambor
Hi,

On Fri, Aug 21 2020, Tamar Christina wrote:
>> 
>> Honza's changes have been motivated to big extent as an enabler for IPA-CP
>> heuristics changes to actually speed up 548.exchange2_r.
>> 
>> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
>> weeks ago, this week it is 403, but with my WIP (and so far untested) patch
>> below it is just 276 seconds - faster than one built with GCC 8 which needs
>> 283 seconds.
>> 
>> I'll be interested in knowing if it also works this well on other 
>> architectures.
>> 

I have posted the new version of the patch series to the mailing list
yesterday and I have also pushed the branch to the FSF repo as
refs/users/jamborm/heads/ipa-context_and_exchange-200907

>
> Many thanks for working on this!
>
> I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference.
> Do I need any flags for it to work? The patch was applied on top of 
> 656218ab982cc22b826227045826c92743143af1
>

I only have access to fairly old AMD (Seattle) Opteron 1100 which might
not support some interesting Aarch64 ISA extensions but I can measure a
significant speedup on it (everything with just -Ofast -march=native
-mtune=native, no non-default parameters, without LTO, without any
inlining options):

  GCC 10 branch:  915 seconds
  Master (rev. 995bb851ffe):  989 seconds
  My branch:  827 seconds

(All is 548.exchange_r reference run time.)

> And I tried 3 runs
> 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 
> -fno-inline-functions-called-once

This is the first time I saw -fno-inline-functions-called-once used in
this context.  This seems to indicate we are looking at another problem
that at least I have not known about yet.  Can you please upload
somewhere the inlining WPA dumps with and without the option?

Similarly, I do not need LTO for the speedup on x86_64.

The patches in the series should also remove the need for --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 If you still need
them on my branch, could you please again provide me with (WPA, if with
LTO) ipa-cp dumps with and without them?


> 2) -mcpu=native -Ofast -fomit-frame-pointer -flto 
> -fno-inline-functions-called-once
> 3) -mcpu=native -Ofast -fomit-frame-pointer -flto
>
> First one used to give us the best result, with this patch there's no 
> difference between 1 and 2 (11% regression) and the 3rd one is about 15% on 
> top of that.

OK, so the patch did help (but above you wrote it did not?) but not
enough to be as fast as some previous revision and on top of that
-fno-inline-functions-called-once further helps but again not enough?

If correct, this looks like we need to examine what goes wrong
specifically in the case of Neoverse-N1 though.

Thanks,

Martin



RE: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-21 Thread Tamar Christina
Hi Martin,

> Hi,
> 
> On Thu, Aug 20 2020, Richard Sandiford wrote:
> >>
> >>
> >> Really appreciate for your detailed explanation.  BTW, My previous
> >> patch for PGO build on exchange2 takes this similar method by setting
> >> each cloned node to 1/10th of the frequency several month agao :)
> >>
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html
> >
> > Does it seem likely that we'll reach a resolution on this soon?
> > I take the point that the patch that introduced the exchange
> > regression
> > [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
> > was just uncovering a latent issue, but still, this is a large
> > regression in an important benchmark to be carrying around.  For those
> > of us doing regular benchmark CI, the longer the performance trough
> > gets, the harder it is to spot other unrelated regressions in the “properly
> optimised” code.
> >
> > So if we don't have a specific plan for fixing the regression soon, I
> > think we should consider reverting the patch until we have something
> > that avoids the exchange regression, even though the patch wasn't
> > technically wrong.
> 
> Honza's changes have been motivated to big extent as an enabler for IPA-CP
> heuristics changes to actually speed up 548.exchange2_r.
> 
> On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
> weeks ago, this week it is 403, but with my WIP (and so far untested) patch
> below it is just 276 seconds - faster than one built with GCC 8 which needs
> 283 seconds.
> 
> I'll be interested in knowing if it also works this well on other 
> architectures.
> 

Many thanks for working on this!

I tried this on an AArch64 Neoverse-N1 machine and didn't see any difference.
Do I need any flags for it to work? The patch was applied on top of 
656218ab982cc22b826227045826c92743143af1

And I tried 3 runs
1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param 
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 
-fno-inline-functions-called-once
2) -mcpu=native -Ofast -fomit-frame-pointer -flto 
-fno-inline-functions-called-once
3) -mcpu=native -Ofast -fomit-frame-pointer -flto

First one used to give us the best result, with this patch there's no 
difference between 1 and 2 (11% regression) and the 3rd one is about 15% on top 
of that.

If there's anything I can do to help just let me know.

Cheers,
Tamar

> The patch still needs a bit of a cleanup.  The change of the default value of
> ipa-cp-unit-growth needs to be done only for small compilation units (like
> inlining does).  I should experiment if the value of
> param_ipa_cp_loop_hint_bonus should be changed or not.  And last but not
> least, I also want to clean-up the interfaces between ipa-fnsummary.c and
> ipa-cp.c a bit.  I am working on all of this and hope to finish the patch set 
> in a
> few (working) days.
> 
> The bottom line is that there is a plan to address this regression.
> 
> Martin
> 
> 
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index e4910a04ffa..0d44310503a
> 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct
> cgraph_node *node,
>  /* Return time bonus incurred because of HINTS.  */
> 
>  static int
> -hint_time_bonus (cgraph_node *node, ipa_hints hints)
> +hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal
> known_iter_freq,
> +  sreal known_strides_freq)
>  {
>int result = 0;
> -  if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride))
> -result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
> +  sreal bonus_for_one = opt_for_fn (node->decl,
> + param_ipa_cp_loop_hint_bonus);
> +
> +  if (hints & INLINE_HINT_loop_iterations)
> +{
> +  /* FIXME: This should probably be way more nuanced.  */
> +  result += (known_iter_freq * bonus_for_one).to_int ();
> +}
> +  if (hints & INLINE_HINT_loop_stride)
> +{
> +  /* FIXME: And this as well.  */
> +  result += (known_strides_freq * bonus_for_one).to_int ();
> +}
> +
>return result;
>  }
> 
> @@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node
> *node, vec known_csts,
>  int est_move_cost, ipcp_value_base *val)  {
>int size, time_benefit;
> -  sreal time, base_time;
> +  sreal time, base_time, known_iter_freq, known_strides_freq;
>ipa_hints hints;
> 
>estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
>known_aggs, , ,
> -  _time, );
> +  _time, , _iter_freq,
> +  _strides_freq);
>base_time -= time;
>if (base_time > 65535)
>  base_time = 65535;
> @@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node
> *node, vec known_csts,
>  time_benefit = base_time.to_int ()
>+ devirtualization_time_bonus (node, known_csts, known_contexts,
>

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-20 Thread Martin Jambor
Hi,

On Thu, Aug 20 2020, Richard Sandiford wrote:
>>
>>
>> Really appreciate for your detailed explanation.  BTW, My previous patch
>> for PGO build on exchange2 takes this similar method by setting each cloned
>> node to 1/10th of the frequency several month agao :)
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html
>
> Does it seem likely that we'll reach a resolution on this soon?
> I take the point that the patch that introduced the exchange regression
> [https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
> was just uncovering a latent issue, but still, this is a large regression
> in an important benchmark to be carrying around.  For those of us doing
> regular benchmark CI, the longer the performance trough gets, the harder
> it is to spot other unrelated regressions in the “properly optimised” code.
>
> So if we don't have a specific plan for fixing the regression soon,
> I think we should consider reverting the patch until we have something
> that avoids the exchange regression, even though the patch wasn't
> technically wrong.

Honza's changes have been motivated to big extent as an enabler for
IPA-CP heuristics changes to actually speed up 548.exchange2_r.

On my AMD Zen2 machine, the run-time of exchange2 was 358 seconds two
weeks ago, this week it is 403, but with my WIP (and so far untested)
patch below it is just 276 seconds - faster than one built with GCC 8
which needs 283 seconds.

I'll be interested in knowing if it also works this well on other
architectures.

The patch still needs a bit of a cleanup.  The change of the default
value of ipa-cp-unit-growth needs to be done only for small compilation
units (like inlining does).  I should experiment if the value of
param_ipa_cp_loop_hint_bonus should be changed or not.  And last but not
least, I also want to clean-up the interfaces between ipa-fnsummary.c
and ipa-cp.c a bit.  I am working on all of this and hope to finish the
patch set in a few (working) days.

The bottom line is that there is a plan to address this regression.

Martin



diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index e4910a04ffa..0d44310503a 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -3190,11 +3190,23 @@ devirtualization_time_bonus (struct cgraph_node *node,
 /* Return time bonus incurred because of HINTS.  */
 
 static int
-hint_time_bonus (cgraph_node *node, ipa_hints hints)
+hint_time_bonus (cgraph_node *node, ipa_hints hints, sreal known_iter_freq,
+sreal known_strides_freq)
 {
   int result = 0;
-  if (hints & (INLINE_HINT_loop_iterations | INLINE_HINT_loop_stride))
-result += opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
+  sreal bonus_for_one = opt_for_fn (node->decl, param_ipa_cp_loop_hint_bonus);
+
+  if (hints & INLINE_HINT_loop_iterations)
+{
+  /* FIXME: This should probably be way more nuanced.  */
+  result += (known_iter_freq * bonus_for_one).to_int ();
+}
+  if (hints & INLINE_HINT_loop_stride)
+{
+  /* FIXME: And this as well.  */
+  result += (known_strides_freq * bonus_for_one).to_int ();
+}
+
   return result;
 }
 
@@ -3395,12 +3407,13 @@ perform_estimation_of_a_value (cgraph_node *node, 
vec known_csts,
   int est_move_cost, ipcp_value_base *val)
 {
   int size, time_benefit;
-  sreal time, base_time;
+  sreal time, base_time, known_iter_freq, known_strides_freq;
   ipa_hints hints;
 
   estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
 known_aggs, , ,
-_time, );
+_time, , _iter_freq,
+_strides_freq);
   base_time -= time;
   if (base_time > 65535)
 base_time = 65535;
@@ -3414,7 +3427,7 @@ perform_estimation_of_a_value (cgraph_node *node, 
vec known_csts,
 time_benefit = base_time.to_int ()
   + devirtualization_time_bonus (node, known_csts, known_contexts,
 known_aggs)
-  + hint_time_bonus (node, hints)
+  + hint_time_bonus (node, hints, known_iter_freq, known_strides_freq)
   + removable_params_cost + est_move_cost;
 
   gcc_checking_assert (size >=0);
@@ -3476,7 +3489,7 @@ estimate_local_effects (struct cgraph_node *node)
 {
   struct caller_statistics stats;
   ipa_hints hints;
-  sreal time, base_time;
+  sreal time, base_time, known_iter_freq, known_strides_freq;
   int size;
 
   init_caller_stats ();
@@ -3484,9 +3497,11 @@ estimate_local_effects (struct cgraph_node *node)
  false);
   estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts,
 known_aggs, , ,
-_time, );
+_time, , _iter_freq,
+_strides_freq);
   time -= devirt_bonus;
-  time -= 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-20 Thread Richard Sandiford
Xionghu, thanks for working on fixing the exchange regression.

luoxhu via Gcc-patches  writes:
> On 2020/8/13 20:52, Jan Hubicka wrote:
>>> Since there are no other callers outside of these specialized nodes, the
>>> guessed profile count should be same equal?  Perf tool shows that even
>>> each specialized node is called only once, none of them take same time for
>>> each call:
>>>
>>>40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.4
>>>16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.3
>>>10.91%  exchange2_gcc.o  libgfortran.so.5.0.0 [.] 
>>> _gfortran_mminloc0_4_i4
>>> 5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.6
>>> 4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __logic_MOD_new_solver
>>> 3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.5
>>> 1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.7
>>> 0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_brute.constprop.0
>>> 0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.2
>>> 0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_digits_2.constprop.1
>>> 0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_covered.constprop.0
>>> 0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_reflected.constprop.0
>>> 0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>>> __brute_force_MOD_brute.constprop.1
>>>
>>>
>>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution 
>>> time,
>>> So profile count and frequency seem not very helpful for this case?
>> Yep, you can not really determine the time spent on each of recursion
>> levels from the recursive edge probability since you can not assume it
>> to be the same on each level of recursion (here we know it is 0 at level
>> 10 and it is slowly dropping as the recursion increases because there
>> are fewer posiblities to fill up the partly filled sudoku:).
>> Especially you can not assume it after the ipa-cp did the work to figure
>> out that there is recursion depth counter that affect the function.
>> 
>> Thinking of the ipa-cp profile updating changes, I did not consider safe
>> the approach of adding extra weight to the recursive call. The problem
>> is that the recursive call frequency estimate, when comming from static
>> profile stimation, is likely completely off, just like static profile
>> estimator can not be used to predict realistically the number of
>> iterations of loops.  However even with profile feedback this is hard to
>> interpret.
>> 
>> I was wondering if we can not simply detect this scenario and avoid
>> using the recursive edge probability in the profile updating.
>> We could simply scale by the number of copies.
>> This means that if we produce 10 clones, we could just set each clone to
>> 1/10th of the frequency.  This implies that the hottest spot will not be
>> underestimated expontentially much as can happen now, but just 10 times
>> at worst, wich is probably still OK. We play similar games in loop
>> optimizers: static profile estimators expect every loop to trip around 3
>> times so unroller/vectorizer/etc would make no sense on them.
>> 
>> By scaling down according to number of copies we will keep the
>> frequencies of calls to function called from clones "sane".  We will
>> still run into problems when we inline one clone to another (sine its
>> proflie will get scaled by the recursive edge frequency), but perhaps
>> that can be special cases in profiler or make ipa-cp to adjust the
>> recursive edges to get frequency close to 1 as a hack.
>> 
>> This is not pretty, but at least it looks safer to me than the original
>> profile updating patch adding extra weight to the frequency.
>
>
> Really appreciate for your detailed explanation.  BTW, My previous patch
> for PGO build on exchange2 takes this similar method by setting each cloned
> node to 1/10th of the frequency several month agao :)
>
> https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

Does it seem likely that we'll reach a resolution on this soon?
I take the point that the patch that introduced the exchange regression
[https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551757.html]
was just uncovering a latent issue, but still, this is a large regression
in an important benchmark to be carrying around.  For those of us doing
regular benchmark CI, the longer the performance trough gets, the harder
it is to spot other unrelated regressions in the “properly optimised” code.

So if we don't have a specific plan for fixing the regression soon,
I think we should consider reverting the patch until we have something
that avoids the exchange regression, even 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-14 Thread luoxhu via Gcc-patches
Hi,

On 2020/8/13 20:52, Jan Hubicka wrote:
>> Since there are no other callers outside of these specialized nodes, the
>> guessed profile count should be same equal?  Perf tool shows that even
>> each specialized node is called only once, none of them take same time for
>> each call:
>>
>>40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.4
>>16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.3
>>10.91%  exchange2_gcc.o  libgfortran.so.5.0.0 [.] 
>> _gfortran_mminloc0_4_i4
>> 5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.6
>> 4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __logic_MOD_new_solver
>> 3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.5
>> 1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.7
>> 0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_brute.constprop.0
>> 0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.2
>> 0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_digits_2.constprop.1
>> 0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_covered.constprop.0
>> 0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_reflected.constprop.0
>> 0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
>> __brute_force_MOD_brute.constprop.1
>>
>>
>> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
>> So profile count and frequency seem not very helpful for this case?
> Yep, you can not really determine the time spent on each of recursion
> levels from the recursive edge probability since you can not assume it
> to be the same on each level of recursion (here we know it is 0 at level
> 10 and it is slowly dropping as the recursion increases because there
> are fewer posiblities to fill up the partly filled sudoku:).
> Especially you can not assume it after the ipa-cp did the work to figure
> out that there is recursion depth counter that affect the function.
> 
> Thinking of the ipa-cp profile updating changes, I did not consider safe
> the approach of adding extra weight to the recursive call. The problem
> is that the recursive call frequency estimate, when comming from static
> profile stimation, is likely completely off, just like static profile
> estimator can not be used to predict realistically the number of
> iterations of loops.  However even with profile feedback this is hard to
> interpret.
> 
> I was wondering if we can not simply detect this scenario and avoid
> using the recursive edge probability in the profile updating.
> We could simply scale by the number of copies.
> This means that if we produce 10 clones, we could just set each clone to
> 1/10th of the frequency.  This implies that the hottest spot will not be
> underestimated expontentially much as can happen now, but just 10 times
> at worst, wich is probably still OK. We play similar games in loop
> optimizers: static profile estimators expect every loop to trip around 3
> times so unroller/vectorizer/etc would make no sense on them.
> 
> By scaling down according to number of copies we will keep the
> frequencies of calls to function called from clones "sane".  We will
> still run into problems when we inline one clone to another (sine its
> proflie will get scaled by the recursive edge frequency), but perhaps
> that can be special cases in profiler or make ipa-cp to adjust the
> recursive edges to get frequency close to 1 as a hack.
> 
> This is not pretty, but at least it looks safer to me than the original
> profile updating patch adding extra weight to the frequency.


Really appreciate for your detailed explanation.  BTW, My previous patch
for PGO build on exchange2 takes this similar method by setting each cloned
node to 1/10th of the frequency several month agao :)

https://gcc.gnu.org/pipermail/gcc-patches/2020-June/546926.html

Xionghu


Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-13 Thread Jan Hubicka
> 
> Thanks for the information :)  Tamar replied that there is another
> regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
> it really getting even slower than before (revert the patch could pull the
> performance back)...

Yep, we need to figure out how to fix this - the summary on IRA issue is
interesting.  I was aware of it, but never really looked into place what
IRA does wrong.

Basically what happened historically is that when exchange2 was newly
added to spec we looked on it with Martin and noticed the issue with the
loop nest being predicted very argressively toward to the innermost loop
which led the loop optimizer to do funny things on loops that really
must not iterate too many times since we know that the frequency of
recursive call is strictly less than 1.

We spent some time on tuning inliner for low trip count loops and also
added the LOOP_GUARD_WITH_PREDICTION heuristics that was meant to reduce
probability of entering the loop which contains recursive call - this
should be a pattern in all similar backtrack-like algorithms. 
The conditions terminating the walk should be likely or the program
would never finish.
> 
> > 
> > Now if ipa-cp decides to duplicate digits few times we have a new
> > problem.  The tree of recursion is orgnaized in a way that the depth is
> > bounded by 10 (which GCC does not know) and moreover most time is not
> > spent on very deep levels of recursion.
> > 
> > For that you have the patch which increases frequencies of recursively
> > cloned nodes, however it still seems to me as very specific hack for
> > exchange: I do not see how to guess where most of time is spent.
> > Even for very regular trees, by master theorem, it depends on very
> > little differences in the estimates of recursion frequency whether most
> > of time is spent on the top of tree, bottom or things are balanced.
> 
> The build is not PGO, so I am not clear how profile count will affect the 
> ipa-cp and ipa-inline decision. 

Even without PGO the counts are used.  predict.c first estimates the
branch probabilities and then these are propagated to estimated counts
of basic blocks and this is still used thorough the compiler to drive
the optimization decisions (so ipa-inline computed the esitmated runtime
effects of the inlining that is all weighted by the counts, similarly
does ipa-cp).

What we ended up was a bug in the patch adding LOOP_GUARD_WITH_REDICTION
which resulted in the loop guards in exchange to be prodicted by both
LOOP_GUARD_WITH_RECRUSIOn and LOOP_GUARD.
Since first claims 85% chance that loop will not be entered and
second 75% the combined outcome got over 90% probability and combining
10 conditions resulted in very small frequency of the recursive edge.
It for did helps IRA to allocate sanely, but not for good reasons,
so we ended up with exchange improvements and did not notice the
bug (this is one fixed by patch above).

For some releases PGO performance is slower than non-PGO
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/options
which I think is a combination of IRA and bad decisions in some loop
opts.  The other problem is that vectorizer tends to blow up the
register pressure too.

> Since there are no other callers outside of these specialized nodes, the
> guessed profile count should be same equal?  Perf tool shows that even
> each specialized node is called only once, none of them take same time for
> each call:
> 
>   40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.4
>   16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.3
>   10.91%  exchange2_gcc.o  libgfortran.so.5.0.0 [.] 
> _gfortran_mminloc0_4_i4
>5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.6
>4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
>3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.5
>1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.7
>0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_brute.constprop.0
>0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.2
>0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_digits_2.constprop.1
>0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_covered.constprop.0
>0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_reflected.constprop.0
>0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
> __brute_force_MOD_brute.constprop.1
> 
> 
> digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
> So profile count and frequency seem not very helpful for this case? 

Yep, you can not really determine the time spent on each of recursion
levels from the recursive edge probability since you can 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-13 Thread Jan Hubicka
> Hi!
> 
> On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote:
> > On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka  wrote:
> > >> From: Xiong Hu Luo 
> > >> 523.xalancbmk_r +1.32%
> > >> 541.leela_r +1.51%
> > >> 548.exchange2_r +31.87%
> > >> 507.cactuBSSN_r +0.80%
> > >> 526.blender_r   +1.25%
> > >> 538.imagick_r   +1.82%
> 
> > >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > >> index 0211f08964f..11903ac1960 100644
> > >> --- a/gcc/cgraph.h
> > >> +++ b/gcc/cgraph.h
> > >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> > >>cgraph_node *c = callee->ultimate_alias_target ();
> > >>if (caller->inlined_to)
> > >>  return caller->inlined_to->decl == c->decl;
> > >> +  else if (caller->clone_of && c->clone_of)
> > >> +return caller->clone_of->decl == c->clone_of->decl;
> > >>else
> > >>  return caller->decl == c->decl;
> > >
> > >If you clone the function so it is no longer self recursive, it does
> > >not
> > >make much sense to lie to optimizers that the function is still
> > >recursive.
> 
> Like Richard says below (if I understand him right, sorry if not), the
> function still *is* recursive in its group of clones.

The test above is not an heuristics.  Its purpose is to determine when
offline body will be eliminated after inlining the call.  This happens
when
 1) this is last call to the function 
 2) function is not used otherwise (exported or visible)
 3) the call is not self recursive
In case of 3 the problem is that inlning will introduce new call to
function itself and offline copy will still be needed.

Here we see a chain of clones calling
 clone1->clone2->clone3->clone4...->clone9->clone1
inlining clone2 to clone3 will elliminate offline copy of clone3 and
will reduce code size.

Inliner has analysis which intends to model code size/time accurately
and the heuristics part. The patch makes size/time to produce wrong
results, while this needs to be solved in the heuristics part.

Note that with bit of optimization we should be able to eliminate call
clone9->clone1 because it is dead (I believe we don't do that only
becuase recursion limit is set 1 off). This however does not make
inlining clone2->clone3 any better idea. Same is true if user wrote the
clones explicitly.


> 
> > >The inlining would be harmful even if the programer did cloning by
> > >hand.
> > >I guess main problem is the extreme register pressure issue combining
> > >loop depth of 10 in caller with loop depth of 10 in callee just because
> > >the function is called once.
> > >
> > >The negative effect is most likely also due to wrong profile estimate
> > >which drives IRA to optimize wrong spot.  But I wonder if we simply
> > >don't want to teach inlining function called once to not construct
> > >large
> > >loop depths?  Something like do not inline if caller loop depth
> > >is over 3 or so?
> > 
> > I don't think that's good by itself (consider leaf functions and x86 xmm 
> > reg ABI across calls). Even with large loop depth abstraction penalty 
> > removal can make inlining worth it. For the testcase the recursiveness is 
> > what looks special (recursion from a deeper loop nest level). 
> 
> Yes, the loop stuff / register pressure issues might help for the
> exchange result, but what about the other five above?

I think such large loop nest (20 if you combine caller+callee) rarely
happens in practice, so it may be good heuristics. I guess it depends
how much exchange2 only hacks we want in compiler.  We should aim to
make changes that helps other codebases too.

The ipa-cp recursive clonning machinery IMO has good potential to help
other code which has fixed cap on depth of recursion.
Could consider following code:

int array[1000];
static
fact (int a)
{
  int ret;
  if (a > 1)
ret = fact (a - 1) * a;
  for (int i = 0; i < a; i++)
array[i+(a-1)*(a-1)/2] = 0;
  return ret;
}

test ()
{
  return fact (10);
}

This computes factorial and at the same time clears memory (just to
avoid tailcall happening and also to have functions bit bigger).

Here it is a win to produce all 10 clones, inline them together,
precompute factorial and combine the memsets.

We have few problems visible on this testcase
 1) ipa-cp needs reduction of evaluation thresholds
 2) estimated profile is bogus (since we estimate the recursion with
 probability 1/3 and after 10 levels of recursion the inner factorial
 get probability almost 0)
 3) inliner will still happily flatten the function not noticing that it
 has to be cold.  This is because it will do that pairwise and never see
 the combind probability.

Honza


Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-13 Thread luoxhu via Gcc-patches
Hi,

On 2020/8/13 01:53, Jan Hubicka wrote:
> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

Thanks for the information :)  Tamar replied that there is another
regression *on exchange2 is 11%.*, I've also rebased my code and confirmed
it really getting even slower than before (revert the patch could pull the
performance back)...

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.

The build is not PGO, so I am not clear how profile count will affect the 
ipa-cp and ipa-inline decision. 
Since there are no other callers outside of these specialized nodes, the
guessed profile count should be same equal?  Perf tool shows that even
each specialized node is called only once, none of them take same time for
each call:

  40.65%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.4
  16.31%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.3
  10.91%  exchange2_gcc.o  libgfortran.so.5.0.0 [.] _gfortran_mminloc0_4_i4
   5.41%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.6
   4.68%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] __logic_MOD_new_solver
   3.76%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.5
   1.07%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.7
   0.84%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_brute.constprop.0
   0.47%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.2
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_digits_2.constprop.1
   0.24%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_covered.constprop.0
   0.11%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_reflected.constprop.0
   0.00%  exchange2_gcc.o  exchange2_gcc.orig.slow  [.] 
__brute_force_MOD_brute.constprop.1


digits_2.constprop.4 & digits_2.constprop.3 takes most of the execution time,
So profile count and frequency seem not very helpful for this case? 

> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.


> 
>> From: Xiong Hu Luo 
>>
>> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>> size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
>> added
>> build option:
>>
>> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>>
>> ipa-inline pass will consider inline these nodes called only once, but these
>> large functions inlined too deeply will cause serious register spill and
>> performance down as followed.
>>
>> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 
>> 2.7, 2.8)
>> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
>> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
>> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 
>> -> 2.6 -> 2.7 -> 2.8
>>
>> Performance diff:
>> inlineB is ~25% faster than inlineA;
>> inlineC is ~20% faster than inlineB;
>> inlineD is ~30% faster than inlineC.
>>
>> The master GCC code now generates inline sequence like inlineB, this patch
>> makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Feng Xue OS via Gcc-patches
> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile
estimation for exchange2 is far from accurate, the hottest recursive function
is predicted as infrequent. However, this low execution estimation works fine
with IRA. I've tried to tweak likelihood of the predictor, same as you,
performance was degraded when estimated profile increased. This regression
is also found to be correlated with IRA, which produces much more register
spills than default. In presence of deep loops and high register pressure, IRA
behaves more sensitively to profile estimation, and this exhibits an unwanted
property of current IRA algorithm. I've described it in a tracker
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174).

Feng

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.
> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.
> 
>> From: Xiong Hu Luo 
>> 
>>  For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>>  size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
>> added
>>  build option:
>> 
>>  --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>>  ipa-inline pass will consider inline these nodes called only once, but these
>>  large functions inlined too deeply will cause serious register spill and
>>  performance down as followed.
>> 
>>  inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 
>> 2.6, 2.7, 2.8)
>>  inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
>> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>>  inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
>> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>>  inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 
>> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>>  Performance diff:
>>  inlineB is ~25% faster than inlineA;
>>  inlineC is ~20% faster than inlineB;
>>  inlineD is ~30% faster than inlineC.
>> 
>>  The master GCC code now generates inline sequence like inlineB, this patch
>>  makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>>  to the edge when edge's caller is inlined into another function to avoid
>>  inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>>  should also be considered as recursive edge.
>> 
>>  SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without 
>> exchange2).
>>  Any comments?  Thanks.
>> 
>>  523.xalancbmk_r +1.32%
>>  541.leela_r +1.51%
>>  548.exchange2_r +31.87%
>>  507.cactuBSSN_r +0.80%
>>  526.blender_r   +1.25%
>>  538.imagick_r   +1.82%
>> 
>>  gcc/ChangeLog:
>> 
>>  2020-08-12  Xionghu Luo  
>> 
>>* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>callee and specialized from same node.
>>* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>inlined_to growth to edge whose caller is inlined.
>>  ---
>>   gcc/cgraph.h  | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>> 
>>  diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>>  index 0211f08964f..11903ac1960 100644
>>  --- a/gcc/cgraph.h
>>  +++ b/gcc/cgraph.h
>>  @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>> cgraph_node *c = 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Segher Boessenkool
Hi!

On Wed, Aug 12, 2020 at 09:03:35PM +0200, Richard Biener wrote:
> On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka  wrote:
> >> From: Xiong Hu Luo 
> >> 523.xalancbmk_r +1.32%
> >> 541.leela_r +1.51%
> >> 548.exchange2_r +31.87%
> >> 507.cactuBSSN_r +0.80%
> >> 526.blender_r   +1.25%
> >> 538.imagick_r   +1.82%

> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> >> index 0211f08964f..11903ac1960 100644
> >> --- a/gcc/cgraph.h
> >> +++ b/gcc/cgraph.h
> >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
> >>cgraph_node *c = callee->ultimate_alias_target ();
> >>if (caller->inlined_to)
> >>  return caller->inlined_to->decl == c->decl;
> >> +  else if (caller->clone_of && c->clone_of)
> >> +return caller->clone_of->decl == c->clone_of->decl;
> >>else
> >>  return caller->decl == c->decl;
> >
> >If you clone the function so it is no longer self recursive, it does
> >not
> >make much sense to lie to optimizers that the function is still
> >recursive.

Like Richard says below (if I understand him right, sorry if not), the
function still *is* recursive in its group of clones.

> >The inlining would be harmful even if the programer did cloning by
> >hand.
> >I guess main problem is the extreme register pressure issue combining
> >loop depth of 10 in caller with loop depth of 10 in callee just because
> >the function is called once.
> >
> >The negative effect is most likely also due to wrong profile estimate
> >which drives IRA to optimize wrong spot.  But I wonder if we simply
> >don't want to teach inlining function called once to not construct
> >large
> >loop depths?  Something like do not inline if caller loop depth
> >is over 3 or so?
> 
> I don't think that's good by itself (consider leaf functions and x86 xmm reg 
> ABI across calls). Even with large loop depth abstraction penalty removal can 
> make inlining worth it. For the testcase the recursiveness is what looks 
> special (recursion from a deeper loop nest level). 

Yes, the loop stuff / register pressure issues might help for the
exchange result, but what about the other five above?


Segher


Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Richard Biener via Gcc-patches
On August 12, 2020 7:53:07 PM GMT+02:00, Jan Hubicka  wrote:
>Hello,
>with Martin we spent some time looking into exchange2 and my
>understanding of the problem is the following:
>
>There is the self recursive function digits_2 with the property that it
>has 10 nested loops and calls itself from the innermost.
>Now we do not do amazing job on guessing the profile since it is quite
>atypical. First observation is that the callback frequencly needs to be
>less than 1 otherwise the program never terminates, however with 10
>nested loops one needs to predict every loop to iterate just few times
>and conditionals guarding them as not very likely. For that we added
>PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
>(causing regression in exhange since the bad profile turned out to
>disable some harmful vectorization) and I also now added a cap to the
>self recursive frequency so things to not get mispropagated by ipa-cp.
>
>Now if ipa-cp decides to duplicate digits few times we have a new
>problem.  The tree of recursion is orgnaized in a way that the depth is
>bounded by 10 (which GCC does not know) and moreover most time is not
>spent on very deep levels of recursion.
>
>For that you have the patch which increases frequencies of recursively
>cloned nodes, however it still seems to me as very specific hack for
>exchange: I do not see how to guess where most of time is spent. 
>Even for very regular trees, by master theorem, it depends on very
>little differences in the estimates of recursion frequency whether most
>of time is spent on the top of tree, bottom or things are balanced.
>
>With algorithms doing backtracing, like exhchange, the likelyness of
>recusion reduces with deeper recursion level, but we do not know how
>quickly and what the level is.
>
>> From: Xiong Hu Luo 
>> 
>> For SPEC2017 exchange2, there is a large recursive
>functiondigits_2(function
>> size 1300) generates specialized node from digits_2.1 to digits_2.8
>with added
>> build option:
>> 
>> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>> ipa-inline pass will consider inline these nodes called only once,
>but these
>> large functions inlined too deeply will cause serious register spill
>and
>> performance down as followed.
>> 
>> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5
>(inline 2.6, 2.7, 2.8)
>> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4
>(inline digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) ->
>2.4 -> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4
>-> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>> Performance diff:
>> inlineB is ~25% faster than inlineA;
>> inlineC is ~20% faster than inlineB;
>> inlineD is ~30% faster than inlineC.
>> 
>> The master GCC code now generates inline sequence like inlineB, this
>patch
>> makes the ipa-inline pass behavior like inlineD by:
>>  1) The growth acumulation for recursive calls by adding the growth
>data
>> to the edge when edge's caller is inlined into another function to
>avoid
>> inline too deeply;
>>  2) And if caller and callee are both specialized from same node, the
>edge
>> should also be considered as recursive edge.
>> 
>> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without
>exchange2).
>> Any comments?  Thanks.
>> 
>> 523.xalancbmk_r +1.32%
>> 541.leela_r +1.51%
>> 548.exchange2_r +31.87%
>> 507.cactuBSSN_r +0.80%
>> 526.blender_r   +1.25%
>> 538.imagick_r   +1.82%
>> 
>> gcc/ChangeLog:
>> 
>> 2020-08-12  Xionghu Luo  
>> 
>>  * cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>  callee and specialized from same node.
>>  * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>  inlined_to growth to edge whose caller is inlined.
>> ---
>>  gcc/cgraph.h  | 2 ++
>>  gcc/ipa-inline-analysis.c | 3 +++
>>  2 files changed, 5 insertions(+)
>> 
>> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>> index 0211f08964f..11903ac1960 100644
>> --- a/gcc/cgraph.h
>> +++ b/gcc/cgraph.h
>> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>>cgraph_node *c = callee->ultimate_alias_target ();
>>if (caller->inlined_to)
>>  return caller->inlined_to->decl == c->decl;
>> +  else if (caller->clone_of && c->clone_of)
>> +return caller->clone_of->decl == c->clone_of->decl;
>>else
>>  return caller->decl == c->decl;
>
>If you clone the function so it is no longer self recursive, it does
>not
>make much sense to lie to optimizers that the function is still
>recursive.
>
>The inlining would be harmful even if the programer did cloning by
>hand.
>I guess main problem is the extreme register pressure issue combining
>loop depth of 10 in caller with loop depth of 10 in callee just because
>the function is called once.
>
>The negative effect is most likely also due to wrong profile estimate
>which drives IRA to optimize wrong 

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Jan Hubicka
Hello,
with Martin we spent some time looking into exchange2 and my
understanding of the problem is the following:

There is the self recursive function digits_2 with the property that it
has 10 nested loops and calls itself from the innermost.
Now we do not do amazing job on guessing the profile since it is quite
atypical. First observation is that the callback frequencly needs to be
less than 1 otherwise the program never terminates, however with 10
nested loops one needs to predict every loop to iterate just few times
and conditionals guarding them as not very likely. For that we added
PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
(causing regression in exhange since the bad profile turned out to
disable some harmful vectorization) and I also now added a cap to the
self recursive frequency so things to not get mispropagated by ipa-cp.

Now if ipa-cp decides to duplicate digits few times we have a new
problem.  The tree of recursion is orgnaized in a way that the depth is
bounded by 10 (which GCC does not know) and moreover most time is not
spent on very deep levels of recursion.

For that you have the patch which increases frequencies of recursively
cloned nodes, however it still seems to me as very specific hack for
exchange: I do not see how to guess where most of time is spent. 
Even for very regular trees, by master theorem, it depends on very
little differences in the estimates of recursion frequency whether most
of time is spent on the top of tree, bottom or things are balanced.

With algorithms doing backtracing, like exhchange, the likelyness of
recusion reduces with deeper recursion level, but we do not know how
quickly and what the level is.

> From: Xiong Hu Luo 
> 
> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
> size 1300) generates specialized node from digits_2.1 to digits_2.8 with added
> build option:
> 
> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> 
> ipa-inline pass will consider inline these nodes called only once, but these
> large functions inlined too deeply will cause serious register spill and
> performance down as followed.
> 
> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 2.6, 
> 2.7, 2.8)
> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 2.5 
> -> 2.6 -> 2.7 -> 2.8
> 
> Performance diff:
> inlineB is ~25% faster than inlineA;
> inlineC is ~20% faster than inlineB;
> inlineD is ~30% faster than inlineC.
> 
> The master GCC code now generates inline sequence like inlineB, this patch
> makes the ipa-inline pass behavior like inlineD by:
>  1) The growth acumulation for recursive calls by adding the growth data
> to the edge when edge's caller is inlined into another function to avoid
> inline too deeply;
>  2) And if caller and callee are both specialized from same node, the edge
> should also be considered as recursive edge.
> 
> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without exchange2).
> Any comments?  Thanks.
> 
> 523.xalancbmk_r +1.32%
> 541.leela_r +1.51%
> 548.exchange2_r +31.87%
> 507.cactuBSSN_r +0.80%
> 526.blender_r   +1.25%
> 538.imagick_r   +1.82%
> 
> gcc/ChangeLog:
> 
> 2020-08-12  Xionghu Luo  
> 
>   * cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>   callee and specialized from same node.
>   * ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>   inlined_to growth to edge whose caller is inlined.
> ---
>  gcc/cgraph.h  | 2 ++
>  gcc/ipa-inline-analysis.c | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 0211f08964f..11903ac1960 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>cgraph_node *c = callee->ultimate_alias_target ();
>if (caller->inlined_to)
>  return caller->inlined_to->decl == c->decl;
> +  else if (caller->clone_of && c->clone_of)
> +return caller->clone_of->decl == c->clone_of->decl;
>else
>  return caller->decl == c->decl;

If you clone the function so it is no longer self recursive, it does not
make much sense to lie to optimizers that the function is still
recursive.

The inlining would be harmful even if the programer did cloning by hand.
I guess main problem is the extreme register pressure issue combining
loop depth of 10 in caller with loop depth of 10 in callee just because
the function is called once.

The negative effect is most likely also due to wrong profile estimate
which drives IRA to optimize wrong spot.  But I wonder if we simply
don't want to teach inlining function called once to not construct large
loop depths?  Something like do not inline if caller loop depth
is