Tamar Christina <[email protected]> 于2020年9月12日周六 上午1:39写道:
> Hi Martin,
>
> >
> > can you please confirm that the difference between these two is all due
> to
> > the last option -fno-inline-functions-called-once ? Is LTo necessary?
> I.e., can
> > you run the benchmark also built with the branch compiler and
> -mcpu=native
> > -Ofast -fomit-frame-pointer -fno-inline-functions-called-once ?
> >
>
> Done, see below.
>
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto
> > | -24% | | |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> > > | Branch | -mcpu=native -Ofast -fomit-frame-pointer
> > | -26% | | |
> > >
> +----------+------------------------------------------------------------------------------
> >
> --------------------------------------------------------------------+--------------+--+--+
> >
> > >
> > > (Hopefully the table shows up correct)
> >
> > it does show OK for me, thanks.
> >
> > >
> > > It looks like your patch definitely does improve the basic cases. So
> > > there's not much difference between lto and non-lto anymore and it's
> > much Better than GCC 10. However it still contains the regression
> introduced
> > by Honza's changes.
> >
> > I assume these are rates, not times, so negative means bad. But do I
> > understand it correctly that you're comparing against GCC 10 with the two
> > parameters set to rather special values? Because your table seems to
> > indicate that even for you, the branch is faster than GCC 10 with just -
> > mcpu=native -Ofast -fomit-frame-pointer.
>
> Yes these are indeed rates, and indeed I am comparing against the same
> options
> we used to get the fastest rates on before which is the two parameters and
> the inline flag.
>
> >
> > So is the problem that the best obtainable run-time, even with obscure
> > options, from the branch is slower than the best obtainable run-time from
> > GCC 10?
> >
>
> Yeah that's the problem, when we compare the two we're still behind.
>
> I've done the additional two runs
>
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Compiler | Flags
>
> | diff GCC 10 |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer
>
> | -44% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 10 | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
> | -36% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | GCC 11 | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> | -22% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> -fno-inline-functions-called-once | -12% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto
>
> | -24% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer
>
> | -26% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer -flto
> -fno-inline-functions-called-once
> | -12% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
> | Branch | -mcpu=native -Ofast -fomit-frame-pointer
> -fno-inline-functions-called-once
> | -11% |
>
> +----------+--------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
>
> And this confirms that indeed LTO isn't needed and that the branch
> without any options is indeed much better than it was on GCC 10 without
> any options.
>
> It also confirms that the only remaining difference is in the
> -fno-inline-functions-called-once
>
> > >
> > >> > And I tried 3 runs
> > >> > 1) -mcpu=native -Ofast -fomit-frame-pointer -flto --param
> > >> > ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
> > >> > -fno-inline-functions-called-once
> > >>
> > >> This is the first time I saw -fno-inline-functions-called-once used
> > >> in this context. This seems to indicate we are looking at another
> > >> problem that at least I have not known about yet. Can you please
> > >> upload somewhere the inlining WPA dumps with and without the option?
> > >
> > > We used it to cover up for the register allocation issue where in
> > > lining some large functions would cause massive spilling. Looks like
> > > it still has an effect now but even with it we're still seeing the 12%
> > regression.
> > >
> > > Which option is this? -fdump-ipa-cgraph?
> >
> > -fdump-ipa-inline-details and -fdump-ipa-cp-details.
>
> I've kicked off the CI runs and will get you the dumps on Monday.
>
> Cheers,
> Tamar
>
> >
> > It would be nice if the slowdown was all due to the inliner. But the
> predictors
> > changes might of course have quite an impact also on other optimizations.
> >
> > Martin
>
>
Hi Martin,
Thanks for your work. In case you are interested, here is the exchange2
result for your branch on our Cascadelake server (based on Tamar's test and
our regular configuration):
| Compiler | Flags
| single-core diff GCC10 | multi-core diff GCC10 |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|----------------------|
| GCC10.1 | -march=native -Ofast -funroll-loops -flto --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
-fno-inline-functions-called-once | - | -
|
| GCC10.1 | -march=native -Ofast -funroll-loops
| -32% | -37% |
| GCC10.1 | -march=native -Ofast -funroll-loops -flto
| -32% | -37% |
| GCC11 | -march=native -Ofast -funroll-loops -flto --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
-fno-inline-functions-called-once | -20% | -13%
|
| Branch | -march=native -Ofast -funroll-loops -flto --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
| -39% | -28% |
| Branch | -march=native -Ofast -funroll-loops -flto --param
ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
-fno-inline-functions-called-once | -20% | -13%
|
| Branch | -march=native -Ofast -funroll-loops -flto
| -39% | -28% |
| Branch | -march=native -Ofast -funroll-loops
| -41% | -29% |
| Branch | -march=native -Ofast -funroll-loops -flto
-fno-inline-functions-called-once
| -19% | -13% |
| Branch | -march=native -Ofast -funroll-loops
-fno-inline-functions-called-once
| -20% | -13%
|
For multi-core tests, it can provide better performance without extra ipa
options, but still 12% regression compared with GCC10's best score.
Also for single-core, there's a about 7% gap between the branch and GCC10.1.
Regards,
Hongyu Wang