https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082
Bug ID: 123082
Summary: missed optimization: large local array in recursive
function needlessly kept alive across recursive calls
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: skywave2023 at outlook dot com
Target Milestone: ---
Steps to reproduce
------------------
1. Compile the following test case with GCC at a high optimization level, e.g.:
g++ -std=gnu++23 -Ofast -S mgsort.cc -o mgsort.s
Tested with: g++ (GCC) 15.2.0 on x86_64-pc-linux-gnu.
2. Inspect the generated assembly for mgsort(int, int).
Test case (Compiler Explorer: https://godbolt.org/z/WYarKTTjW)
---------
#include <algorithm>
#include <cstring>
constexpr int N = 100000 + 1;
int a[N];
void mgsort(int l, int r) {
if (l == r) {
return;
}
int mid = (l + r) >> 1;
mgsort(l, mid);
mgsort(mid + 1, r);
int b[N];
std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}
(Any similar testcase where a large local array is only used after both
recursive calls should be equivalent.)
Actual results
--------------
For non-leaf calls, GCC allocates the full `int b[N]` array in the prologue of
`mgsort` before making any recursive calls, and keeps this stack space reserved
until the function returns.
On x86_64, the prologue looks like (simplified):
cmp edi, esi
je .Lbase_case
push r15
push r14
push r13
push r12
push rbp
push rbx
sub rsp, 400040 # reserve ~400k for local array b[N]
call mgsort # recursive call (left half)
call mgsort # recursive call (right half)
...
lea r13, [rsp+16+rdx] # use rsp-based region as b + l
... # merge + memcpy
add rsp, 400040
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
So each recursive frame keeps its own full `b[N]` array on the stack even
though `b` is only accessed *after* both recursive calls return.
With N = 100000 + 1, sizeof(b) is about 400 kB. The recursion depth of this
merge-sort is O(log N) (about 17 for this N), so just this single local array
accounts for roughly depth * sizeof(b) ≈ 6–7 MB of stack usage, even though
there is never any need to have more than one `b` “live” at a time.
Expected results
----------------
Since the address of `b` never escapes and is only passed to non-recursive
library functions (`std::merge` and `memcpy`), and no pointer to `b` is passed
into the recursive calls of `mgsort`, there is no observable way for a
recursive call to access its caller's `b`.
This suggests that GCC could legally shrink the *actual lifetime of the
storage* for `b` so that it does not overlap the recursive calls, for example
by:
* moving the stack allocation for `b` (the `sub rsp, ...`) from the function
prologue to just before the merge+memcpy code, and
* freeing it (the `add rsp, ...`) immediately after `memcpy` returns, before
the function returns to its caller.
With such an optimization, at run time there would be at most one active
instance of `b` on the stack at any given moment: when a deeper recursive call
allocates its own `b`, the callers would not yet have allocated theirs (because
they only allocate `b` *after* their recursive calls have finished).
This would reduce the stack usage due to `b` from approximately `depth *
sizeof(b)` bytes to roughly `1 * sizeof(b) + O(depth)` bytes, without changing
the observable behavior of the program, as far as I can tell.
In other words, this looks like a missed opportunity to use lifetime analysis
to avoid keeping large local arrays alive across recursive calls.
Additional information
----------------------
* The same behavior is observed with both -O3 and -Ofast.
* The option -fstack-reuse=all is already the default and does not address this
issue, because it only controls reuse of stack slots for local variables and
temporaries *within the same frame*, not across different recursive
activations. So there seems to be no command-line option that enables this
optimization. (See the documentation of -fstack-reuse.)
* From the source-level code, the first use of `b` is strictly after both
recursive calls, and no pointer to `b` is stored anywhere that could be used by
a recursive call. So it seems to be safe to reorder the actual allocation of
storage for `b` as described above, under the “as-if” rule.
If this reasoning is correct, I would expect GCC to eventually be able to
shrink the lifetime of such large local arrays in recursive functions, or at
least consider this as an optimization opportunity.