http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49035
Summary: Avoid setting up stack frame for short, hot code paths Product: gcc Version: 4.5.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: scov...@gmail.com I often find myself writing functions of the following form: void foo () { if (common_case) { /* do a little work and return */ } /* uncommon case: do a lot of work, call functions, etc. */ } The resulting assembly code always sets up a stack frame in the function prologue, even though the function usually executes as a leaf using few (or zero) of the callee-save registers and stack slots it saves. Here's an example which is only slightly contrived: === rfe.cpp ============ struct link { link* prev; long go_slow; void frob(link* parent, link* grandparent); }; link* foo(link* list) { link* prev = list->prev; while (__builtin_expect(prev->go_slow, 0)) { link* pprev = __sync_lock_test_and_set(&prev->prev, 0); pprev->frob(prev, list); prev = pprev; } return prev; } === rfe.cpp ============ Compiling the above with `x86_64-unknown-linux-gnu-g++-4.5.2 -O3 -S' yields _Z3fooP4link: .LFB0: movq %rbx, -24(%rsp) movq %rbp, -16(%rsp) movq %rdi, %rbx movq %r12, -8(%rsp) subq $24, %rsp movq (%rdi), %rax cmpq $0, 8(%rax) jne .L8 .L2: movq (%rsp), %rbx movq 8(%rsp), %rbp movq 16(%rsp), %r12 addq $24, %rsp ret .L8: xorl %r12d, %r12d .L6: movq %r12, %rbp xchgq (%rax), %rbp movq %rbx, %rdx movq %rax, %rsi movq %rbp, %rdi call _ZN4link4frobEPS_S0_ cmpq $0, 8(%rbp) jne .L4 movq %rbp, %rax jmp .L2 .L4: movq %rbp, %rax jmp .L6 Ideally, it would look like this instead: _Z3fooP4link: .LFB0: ;; *** hot path executes as leaf *** movq (%rdi), %rax cmpq $0, 8(%rax) jne .L8 ret .L8: ;; *** set up stack frame *** movq %rbx, -24(%rsp) movq %rbp, -16(%rsp) movq %rdi, %rbx movq %r12, -8(%rsp) subq $24, %rsp ;; *** xorl %r12d, %r12d .L6: movq %r12, %rbp xchgq (%rax), %rbp movq %rbx, %rdx movq %rax, %rsi movq %rbp, %rdi call _ZN4link4frobEPS_S0_ cmpq $0, 8(%rbp) jne .L4 ;; *** tear down stack frame *** movq %rbp, %rax movq (%rsp), %rbx movq 8(%rsp), %rbp movq 16(%rsp), %r12 addq $24, %rsp ;; *** ret .L4: movq %rbp, %rax jmp .L6 The effect can sometimes be simulated using an inlined foo which includes the fast path and a call to the (non-inlined) slow path, but the whims of function inlining often conspire against it even when callers are able to inline foo (e.g. foo is not a library function). There's probably some overlap with partial inlining here: the ideal case essentially splits the slow path off into its own function (called using tail recursion); presumably partial inlining would inline the fast path while having all callers jump to the same copy of the slow path function. However, the optimization is arguably useful even if foo is never inlined at all. Thoughts? Ryan