[Bug gas/33943] gas: .prefalign directive for body-size-dependent function alignment

maskray at sourceware dot org Thu, 12 Mar 2026 10:26:46 -0700

https://sourceware.org/bugzilla/show_bug.cgi?id=33943


--- Comment #2 from Fangrui Song <maskray at sourceware dot org> ---
(In reply to Jan Beulich from comment #1)
> (In reply to Fangrui Song from comment #0)
> > Compilers emit `.p2align 4` (or similar) before functions to align them to a
> > preferred boundary (e.g. 16 bytes on x86-64). This is good for large
> > functions but wasteful for small ones: a 3-byte function padded to a 16-byte
> > boundary wastes up to 15 bytes — 500% overhead.
> 
> Hmm, for me 16 - 3 = 13.

If the current location is 1 mod 16, .p2align 4 advances the location to 0 mod
16, requiring a 15-byte padding.

> As you're considering compilers, wouldn't such very small functions
> generally best be inlined? And even if not, as that's not possible when e.g.
> a library has to provide an implementation, won't compilers' size estimates
> for extremely small functions generally be correct?

Small functions exist in practice more often than one might expect.
For example, C++ virtual tables take function addresses, making many small
functions necessary.

The original author of the .prefalign directive in LLVM also designed LLVM's
LTO-based Control-flow Integrity. CFI can result in many more small functions.

To enforce integrity, an indirect call like `call *fptr` is replaced by a
protective stub. This stub performs two primary actions:

* verifies that the target address is a member of the valid target set and
converts that address into a small index.
* dereferences the specific jump table entry and jumps to the target.

This produces many stubs, appearing either as simple relays (foo_jt: jmp foo)
or as fully inlined versions of the target function (foo_jt:
<inlined-body-of-foo>).

Why foo_jt can't just be foo: foo's address may not be guaranteed to fall
inside the jump table's contiguous region. For example, foo can appear in
multiple indirect call sites.

This proliferation of jump stubs was the primary motivation for the author to
introduce the ld.lld --branch-to-branch optimization
https://github.com/llvm/llvm-project/pull/138366

> > .prefalign <pref_align>, <end_sym>, nop
> > .prefalign <pref_align>, <end_sym>, <fill_byte>
> > 
> > - `pref_align`: the preferred (maximum) alignment, must be a power of 2
> 
> Why not make it .p2align-like, requiring the power of 2 to be specified?

I'm fine with

.prefalign <log2_align>, <end_sym>, nop

to save a logarithm in the assembler and be consistent with the popular
.p2align directive.

> 
> > - `end_sym`: a symbol marking the end of the code body
> > - Third operand: `nop` for target-appropriate (variable-size) NOP fill, or
> > an integer byte value `[0, 255]`
> 
> This looks x86-centric. The padding in code may want to be something else
> than NOP, yet that can't be specified by <fill_byte> if insn size /
> granularity is larger than a byte. This would need to be a fill pattern of
> (at least) insn granularity size.

The byte fill is for x86 int3 and for non-code uses (almost always 0).

We can require

.prefalign 4, end, 00
.prefalign 4, end, cc

Then if multi-byte fills are ever needed,

.prefalign 4, end, 0001
.prefalign 4, end, 00010203

> > To enforce a minimum alignment independently, users emit both `.p2align` and
> > `.prefalign`.
> 
> I consider the need to use two directives as problematic. What if they're
> not sitting back to back?

The two are composable: .p2align guarantees a floor, .prefalign adds
size-proportional
  alignment on top.
Merging them into one directive would either duplicate .p2align's functionality
or complicate the semantics.
The separation mirrors how latest LLVM thinks about alignment: align(N)
(minimum, mandatory) vs prefalign(N) (preferred,
size-dependent).

> And the - how is this going to work for targets aiming at mainly link-time
> relaxation (RISC-V for example)?

.prefalign's forward reference to end_sym is resolved within the assembler's
own iterative relaxation, not deferred to the linker.
If the linker would shrink the instructions (e.g. auipc+jalr -> jal), the
assembler-inserted padding will remain.
I think RISC-V and LoongArch derive slightly less benefit from .prefalign
directive.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug gas/33943] gas: .prefalign directive for body-size-dependent function alignment

Reply via email to