Slightly increasing the complexity of a function can disproportionately
increase the size and runtime of the generated code. This appears to be due to
the optimisers giving up on code blocks above a certain abstract size, and is
particularly severe on PPC and ARM, but is observable on ia32 and amd64 as
This is a general problem which affects any large function, and has done since
at least gcc3 days - I first encountered it when trying to use Altivec
intrinsics. In some cases manually moving a function call *out* of a loop
results in 4x the runtime, which is the opposite of normal expectations.
Attached is an example which demonstrates poor code generated after a long
series of inlining and dead code elimination stages. Demonstration is on
PPC32, but the same example suffices for ARMv7-A as well. An amd64 target
produces reasonable code for this example, but a fairly small complexity
increase causes a similar collapse.
The output is two functions, one generated from the tower of inlining, and the
other (with a manual_ prefix) after the same optimisations were performed
manually. The quality of the latter is clearly better than the former, which
contains the following sequence in the inner loop:
All of the stw's in the above fragment are dead, except the "stw 4,24(1)" which
merely shuffles the value from f0 through two memory locations and back to f0.
The "li 4,0" also demonstrates very poor register allocation, since r4 already
contains zero before this fragment. In the "manual" variant, the fmuls is
immediately followed by the fmadds.
The same source file run through Clang on amd64 produces virtually identical
output for the two versions.
Summary: Optimisations fail above arbitrary level of complexity
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: jonathan dot morton at movial dot com
GCC build triplet: powerpc-linux-gnu
GCC host triplet: powerpc-linux-gnu
GCC target triplet: powerpc-linux-gnu