https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91981
Bug ID: 91981 Summary: Speed degradation because of inlining a register clobbering function Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example that is a simplified version of boost::container::small_vector: #define MAKE_INLINING_BAD 1 struct vector { int* data_; int* capacity_; int* size_; void push_back(int v) { if (capacity_ > size_) { *size_ = v; ++size_; } else { reallocate_and_push(v); } } void reallocate_and_push(int v) #if MAKE_INLINING_BAD { // Just some code that clobbers many registers. // You may skip reading it const auto old_cap = capacity_ - data_; const auto old_size = capacity_ - size_; const auto new_cap = old_cap * 2 + 1; auto new_data_1 = new int[new_cap]; auto new_data = new_data_1; for (int* old_data = data_; old_data != size_; ++old_data, ++new_data) { *new_data = *old_data; } delete[] data_; data_ = new_data_1; size_ = new_data_1 + old_size; capacity_ = new_data_1 + new_cap; *size_ = v; ++size_; } #else ; #endif }; void bad_inlining(vector& v) { v.push_back(42); } With `#define MAKE_INLINING_BAD 0` the generated code is quite good: bad_inlining(vector&): mov rax, QWORD PTR [rdi+16] cmp QWORD PTR [rdi+8], rax jbe .L2 mov DWORD PTR [rax], 42 add rax, 4 mov QWORD PTR [rdi+16], rax ret .L2: mov esi, 42 jmp vector::reallocate_and_push(int) However, with `#define MAKE_INLINING_BAD 1` the compiler decides to inline the `reallocate_and_push` function that clobbers many registers. So the compiler stores the values of those registers on the stack before doing the cmp+jbe: bad_inlining(vector&): push r13 ; don't need those for the `(capacity_ > size_)` case push r12 ; likewise push rbp ; likewise push rbx ; likewise mov rbx, rdi ; likewise sub rsp, 8 ; likewise mov rdx, QWORD PTR [rdi+8] mov rax, QWORD PTR [rdi+16] cmp rdx, rax jbe .L2 mov DWORD PTR [rax], 42 add rax, 4 mov QWORD PTR [rdi+16], rax add rsp, 8 ; don't need those for the `(capacity_ > size_)` case pop rbx ; likewise pop rbp ; likewise pop r12 ; likewise pop r13 ; likewise ret .L2: ; vector::reallocate_and_push(int) implementation goes here This greatly degrades the performance of the first branch (more than x3 degradation in real code). The possible fix would be to place all the push/pop operations near the inlined `reallocate_and_push`: bad_inlining(vector&): mov rax, QWORD PTR [rdi+16] cmp QWORD PTR [rdi+8], rax jbe .L2 mov DWORD PTR [rax], 42 add rax, 4 mov QWORD PTR [rdi+16], rax ret .L2: push r13 push r12 push rbp push rbx mov rbx, rdi sub rsp, 8 ; vector::reallocate_and_push(int) implementation goes here add rsp, 8 pop rbx pop rbp pop r12 pop r13 ret Godbolt playground: https://godbolt.org/z/oDutOd