https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95264

--- Comment #6 from Freddie Witherden <freddie at witherden dot org> ---
(In reply to Richard Biener from comment #3)
> So with the [[gnu::flatten]] attributes removed -O1 needs 80 seconds to
> compile and about 3GB of memory, -O2 needs around 2 minutes (same memory),
> -O3
> is the same as -O2.
> 
> Maybe instead of [[gnu::flatten]] you want to bump --param inline-unit-growth
> or --param large-function-growth more moderately in case you can measure an
> effect on runtime.
> 
> Note multiple [[gnu::flatten]] can really exponentially grow program size
> since it is not appearant which functions might be used from another
> translation unit until you can use -fwhole-program (single CU program)
> or -flto (but there [[gnu::flatten]] is applied to early to avoid such
> growth - sth we might want to fix).  Placing things not used from outside
> in anonymous namespaces might help.

The [[gnu::flatten]] was added to get GCC's performance in the case of T =
double on a par with Clang's.  (We don't care about performance with T = bfloat
as it is just used as a final polishing pass.)  I can understand why GCC does
not want to inline it in the case of T = bfloat which is a complex type, but
for T = double the function is basically just a sequence of mov's to populate
an array.

As the function is of the form

for (int i = 0; i < N; i++) // N = template arg
  for (int j = 0; j < p[N]; j++) // runtime trip count
      foo(i, ...); // static polymorphism

with foo being a large switch-case on its first argument the expectation was
for the compiler to inline foo, unroll the outer loop, and then prune the dead
cases such that we have something similar to

for (int j = 0; j < p[0]; j++)
    foo(0, ...); // inline i = 0 case
for (int j = 0; j < p[1]; j++)
    foo(1, ...); // inline i = 1 case
// ...

Reply via email to