https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91981

            Bug ID: 91981
           Summary: Speed degradation because of inlining a register
                    clobbering function
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: antoshkka at gmail dot com
  Target Milestone: ---

Consider the example that is a simplified version of
boost::container::small_vector:


#define MAKE_INLINING_BAD 1

struct vector {
    int* data_;
    int* capacity_;
    int* size_;

    void push_back(int v) {
        if (capacity_ > size_) {
            *size_ = v;
            ++size_;
        } else {
            reallocate_and_push(v);
        }
    }

    void reallocate_and_push(int v)
#if MAKE_INLINING_BAD
    {
        // Just some code that clobbers many registers.
        // You may skip reading it
        const auto old_cap = capacity_ - data_; 
        const auto old_size = capacity_ - size_; 
        const auto new_cap = old_cap * 2 + 1;

        auto new_data_1 = new int[new_cap];
        auto new_data = new_data_1;
        for (int* old_data = data_; old_data != size_; ++old_data, ++new_data)
{
            *new_data = *old_data;
        }

        delete[] data_;
        data_ = new_data_1;
        size_ = new_data_1 + old_size;
        capacity_ = new_data_1 + new_cap;

        *size_ = v;
        ++size_;
    }
#else
    ;
#endif
};

void bad_inlining(vector& v) {
    v.push_back(42);
}


With `#define MAKE_INLINING_BAD 0` the generated code is quite good:

bad_inlining(vector&):
  mov rax, QWORD PTR [rdi+16]
  cmp QWORD PTR [rdi+8], rax
  jbe .L2
  mov DWORD PTR [rax], 42
  add rax, 4
  mov QWORD PTR [rdi+16], rax
  ret
.L2:
  mov esi, 42
  jmp vector::reallocate_and_push(int)

However, with `#define MAKE_INLINING_BAD 1` the compiler decides to inline the
`reallocate_and_push` function that clobbers many registers. So the compiler
stores the values of those registers on the stack before doing the cmp+jbe:

bad_inlining(vector&):
  push r13 ; don't need those for the `(capacity_ > size_)` case
  push r12     ; likewise
  push rbp     ; likewise
  push rbx     ; likewise
  mov rbx, rdi ; likewise
  sub rsp, 8   ; likewise
  mov rdx, QWORD PTR [rdi+8]
  mov rax, QWORD PTR [rdi+16]
  cmp rdx, rax
  jbe .L2
  mov DWORD PTR [rax], 42
  add rax, 4
  mov QWORD PTR [rdi+16], rax
  add rsp, 8 ; don't need those for the `(capacity_ > size_)` case
  pop rbx     ; likewise
  pop rbp     ; likewise
  pop r12     ; likewise
  pop r13     ; likewise
  ret
.L2: 
  ; vector::reallocate_and_push(int) implementation goes here

This greatly degrades the performance of the first branch (more than x3
degradation in real code).


The possible fix would be to place all the push/pop operations near the inlined
`reallocate_and_push`:

bad_inlining(vector&):
  mov rax, QWORD PTR [rdi+16]
  cmp QWORD PTR [rdi+8], rax
  jbe .L2
  mov DWORD PTR [rax], 42
  add rax, 4
  mov QWORD PTR [rdi+16], rax
  ret
.L2: 
  push r13
  push r12
  push rbp
  push rbx
  mov rbx, rdi
  sub rsp, 8
  ; vector::reallocate_and_push(int) implementation goes here
  add rsp, 8
  pop rbx
  pop rbp
  pop r12
  pop r13
  ret

Godbolt playground: https://godbolt.org/z/oDutOd

Reply via email to