https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124314
Bug ID: 124314
Summary: Feature request: Emit .prefalign for
body-size-dependent function alignment
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: bootstrap
Assignee: unassigned at gcc dot gnu.org
Reporter: maskray at gcc dot gnu.org
Target Milestone: ---
GCC offers a family of performance-tuning options named -falign-*, that
instruct the compiler to align certain code segments to specific memory
boundaries. These options might improve performance by preventing certain
instructions from crossing cache line boundaries (or instruction fetch
boundaries), which can otherwise cause an extra cache miss.
Inefficiency with Small Functions: Aligning small functions can be inefficient
and may not be worth the overhead. To address this, GCC introduced
-flimit-function-alignment in 2016. When the -falign-functions max-skip (the
padding budget, defaulting to N-1) is greater than or equal to the function's
code size, this option caps the .p2align max-skip operand to the function size
minus one, preventing the NOP padding from exceeding the function body itself.
GCC computes the function code size via shorten_branches in final.cc, which
stores it in crtl->max_insn_address, then assemble_start_function in varasm.cc
uses it to cap max_skip.
% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc -
-o - -falign-functions=16 | grep p2align
.p2align 4
% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc -
-o - -falign-functions=16 -flimit-function-alignment | grep p2align
.p2align 4,,3
This is an all-or-nothing decision:
- If the function happens to be within 2 bytes of a 64-byte boundary, it gets
full 64-byte alignment.
- Otherwise, it gets no alignment at all — not even 4-byte alignment, which
would still be beneficial.
Proportional alignment would be better:
If the cache block size is 64 and the goal is to minimize the number of cache
blocks a function spans, it suffices to align the function start to min(64,
bit_ceil(body_size)). That's the minimum alignment that prevents an unnecessary
boundary crossing. For example, a 12-byte function aligned to 16 is guaranteed
not to cross a 64-byte boundary unnecessarily.
A 3-byte function should get 4-byte alignment (std::bit_ceil(3)).
Additionally:
* Compiler-side size estimate is imprecise
* Target-specific: The max-skip mechanism only works on targets defining
ASM_OUTPUT_MAX_SKIP_ALIGN (x86, AArch64, ARM, PowerPC, RX, Visium). Other
targets fall through to unconditional ASM_OUTPUT_ALIGN.
When the assembler supports .prefalign (the directive proposed in
https://sourceware.org/bugzilla/show_bug.cgi?id=33943 ), GCC could emit it
instead of (or in addition to) the current .p2align-with-max-skip approach.
References
- Blog post with detailed analysis:
https://maskray.me/blog/2025-08-24-understanding-alignment-from-source-to-object-file
- LLVM RFC:
https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019
- GNU Assembler feature request:
https://sourceware.org/bugzilla/show_bug.cgi?id=33943