https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930

            Bug ID: 105930
           Summary: Excessive stack spill generation on 32-bit x86
           Product: gcc
           Version: 12.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: torva...@linux-foundation.org
  Target Milestone: ---

Created attachment 53121
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53121&action=edit
Test-case extracted from the generic blake2b kernel code

Gcc-12 seems to generate a huge number of stack spills on this blake2b
test-case, to the point where it overflows the allowable kernel stack on 32-bit
x86.

This crypto thing has two 128-byte buffers, so a stack frame a bit larger than
256 is expected when the dataset doesn't fit in the register set.

Just as an example, on this code, clang-.14.0.0 generates a stack frame that is
296 bytes. 

In contrast, gcc-12.1.1 generates a stack frame that is almost an order of
magnitude(!) larger, at 2620 bytes.

The trivial Makefile I used for this test-case is

   # The kernel cannot just randomly use FP/MMX/AVX
    CFLAGS := -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
    CFLAGS += -m32
    CFLAGS += -O2

    test:
        gcc $(CFLAGS) -Wall -S blake2b.c
        grep "sub.*%[er]sp" blake2b.s

to easily test different flags and the end result, but as can be seen from
above, it really doesn't need any special flags except the ones that disable
MMX/AVX code generation.

And the generated code looks perfectly regular, except for the fact that it
uses almost 3kB of stack space.

Note that "-m32" is required to trigger this - the 64-bit case does much
better, presumably because it has more registers and this needs fewer spills.
It gets worse with some added debug flags we use in the kernel, but not that
kind of "order of magnitude" worse.

Using -O1 or -Os makes no real difference.

This is presumably due to some newly triggered optimization in gcc-12, but I
can't even begin to guess at what we'd need to disable (or enable) to avoid
this horrendous stack growth. Some very aggressive instruction scheduling thing
that spreads out all the calculations and always wants to spill-and-reload the
subepxressions that it CSE'd? I dunno. 

Pls advice. The excessive stack literally causes build failures due to us using
-Werror-frame-larger-than= to make sure stack use remains sanely bounded. The
kernel stack is a rather limited resource.

Reply via email to