https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79189

            Bug ID: 79189
           Summary: Poor code generation when using stateless lambda
                    instead of normal function
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ldionne.2 at gmail dot com
  Target Milestone: ---

When doing an indirect call (through a function pointer), GCC generates
different code depending on whether the function pointer was obtained by
converting a stateless lambda or a normal function. The code generated for a
normal function is vastly superior, and in fact the call can sometimes be
elided.

Code:

  #if 1
    template <typename T>
    static auto const increment = [](void* self) { ++*static_cast<T*>(self); };
  #else
    template <typename T>
    void increment(void* self) { ++*static_cast<T*>(self); }
  #endif

  struct VTable { void (*increment)(void*); };
  template <typename T> static VTable const vtable{increment<T>};

  struct any_iterator {
    template <typename Iterator>
    explicit any_iterator(Iterator it)
      : vptr_{&vtable<Iterator>}, self_{new Iterator(it)}
    { }
    VTable const* vptr_;
    void* self_;
  };

  int main() {
    int input[100] = {0};
    any_iterator first{&input[0]};
    first.vptr_->increment(first.self_);
  }


Codegen with the lambda:

  increment<int*>::{lambda(void*)#1}::_FUN(void*):
          add     QWORD PTR [rdi], 4
          ret
  main:
          sub     rsp, 408
          xor     eax, eax
          mov     ecx, 50
          mov     rdi, rsp
          rep stosq
          mov     edi, 8
          call    operator new(unsigned long)
          mov     QWORD PTR [rax], rsp
          mov     rdi, rax
          call    [QWORD PTR vtable<int*>[rip]]
          xor     eax, eax
          add     rsp, 408
          ret
  _GLOBAL__sub_I_main:
          mov     QWORD PTR vtable<int*>[rip], OFFSET
FLAT:increment<int*>::{lambda(void*)#1}::_FUN(void*)
          ret


Codegen with the function:

  main:
          sub     rsp, 8
          mov     edi, 8
          call    operator new(unsigned long)
          xor     eax, eax
          add     rsp, 8
          ret


Note that Clang (trunk) makes a much better job at optimizing this. Also see
[1] for this example on compiler explorer. See [2] for a larger example that
exhibits the same behavior, and where this results in a ~10x speed difference
because the call is done in a loop.

[1]: https://godbolt.org/g/HQb5Y5
[2]: http://melpon.org/wandbox/permlink/Gs3njR3STPLk2Ecr

Reply via email to