https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082

            Bug ID: 123082
           Summary: missed optimization: large local array in recursive
                    function needlessly kept alive across recursive calls
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: skywave2023 at outlook dot com
  Target Milestone: ---

Steps to reproduce
------------------

1. Compile the following test case with GCC at a high optimization level, e.g.:

   g++ -std=gnu++23 -Ofast -S mgsort.cc -o mgsort.s

   Tested with: g++ (GCC) 15.2.0 on x86_64-pc-linux-gnu.

2. Inspect the generated assembly for mgsort(int, int).

Test case (Compiler Explorer: https://godbolt.org/z/WYarKTTjW)
---------

#include <algorithm>
#include <cstring>

constexpr int N = 100000 + 1;

int a[N];

void mgsort(int l, int r) {
    if (l == r) {
        return;
    }
    int mid = (l + r) >> 1;
    mgsort(l, mid);
    mgsort(mid + 1, r);

    int b[N];
    std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
    std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}

(Any similar testcase where a large local array is only used after both
recursive calls should be equivalent.)

Actual results
--------------

For non-leaf calls, GCC allocates the full `int b[N]` array in the prologue of
`mgsort` before making any recursive calls, and keeps this stack space reserved
until the function returns.

On x86_64, the prologue looks like (simplified):

    cmp     edi, esi
    je      .Lbase_case
    push    r15
    push    r14
    push    r13
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 400040          # reserve ~400k for local array b[N]
    call    mgsort               # recursive call (left half)
    call    mgsort               # recursive call (right half)
    ...
    lea     r13, [rsp+16+rdx]    # use rsp-based region as b + l
    ...                          # merge + memcpy
    add     rsp, 400040
    pop     rbx
    pop     rbp
    pop     r12
    pop     r13
    pop     r14
    pop     r15
    ret

So each recursive frame keeps its own full `b[N]` array on the stack even
though `b` is only accessed *after* both recursive calls return.

With N = 100000 + 1, sizeof(b) is about 400 kB.  The recursion depth of this
merge-sort is O(log N) (about 17 for this N), so just this single local array
accounts for roughly depth * sizeof(b) ≈ 6–7 MB of stack usage, even though
there is never any need to have more than one `b` “live” at a time.

Expected results
----------------

Since the address of `b` never escapes and is only passed to non-recursive
library functions (`std::merge` and `memcpy`), and no pointer to `b` is passed
into the recursive calls of `mgsort`, there is no observable way for a
recursive call to access its caller's `b`.

This suggests that GCC could legally shrink the *actual lifetime of the
storage* for `b` so that it does not overlap the recursive calls, for example
by:

  * moving the stack allocation for `b` (the `sub rsp, ...`) from the function
prologue to just before the merge+memcpy code, and
  * freeing it (the `add rsp, ...`) immediately after `memcpy` returns, before
the function returns to its caller.

With such an optimization, at run time there would be at most one active
instance of `b` on the stack at any given moment: when a deeper recursive call
allocates its own `b`, the callers would not yet have allocated theirs (because
they only allocate `b` *after* their recursive calls have finished).

This would reduce the stack usage due to `b` from approximately `depth *
sizeof(b)` bytes to roughly `1 * sizeof(b) + O(depth)` bytes, without changing
the observable behavior of the program, as far as I can tell.

In other words, this looks like a missed opportunity to use lifetime analysis
to avoid keeping large local arrays alive across recursive calls.

Additional information
----------------------

* The same behavior is observed with both -O3 and -Ofast.
* The option -fstack-reuse=all is already the default and does not address this
issue, because it only controls reuse of stack slots for local variables and
temporaries *within the same frame*, not across different recursive
activations.  So there seems to be no command-line option that enables this
optimization.  (See the documentation of -fstack-reuse.)
* From the source-level code, the first use of `b` is strictly after both
recursive calls, and no pointer to `b` is stored anywhere that could be used by
a recursive call.  So it seems to be safe to reorder the actual allocation of
storage for `b` as described above, under the “as-if” rule.

If this reasoning is correct, I would expect GCC to eventually be able to
shrink the lifetime of such large local arrays in recursive functions, or at
least consider this as an optimization opportunity.
  • [Bug tree-optimization/123082]... skywave2023 at outlook dot com via Gcc-bugs

Reply via email to