https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93721

            Bug ID: 93721
           Summary: swapping adjacent scalars could be more efficient
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: msebor at gcc dot gnu.org
  Target Milestone: ---

For an implementation of a swap function like this:

  template <class T>
  void swap (std::pair<T, T> &x)
  {
    T t = x.first;
    x.first = x.second;
    x.second = t;
  }

GCC for x86 emits the ROL instruction for T=char:

  _Z4swapIcEvRSt4pairIT_S1_E:
  .LFB97:
        .cfi_startproc
        rolw    $8, (%rdi)
        ret
        .cfi_endproc

but a series of MOV instructions for T=short and T=int:

_Z4swapIiEvRSt4pairIT_S1_E:
.LFB97:
        .cfi_startproc
        movl    (%rdi), %eax
        movl    4(%rdi), %edx
        movl    %eax, 4(%rdi)
        movl    %edx, (%rdi)
        ret
        .cfi_endproc

A hand-coded (but convoluted) implementation of the function like below lets
GCC for x86_64 emit the ROL instruction for both int and short:

  void swap (std::pair<int, int> &x)
  {
    int y[2], t;
    static_assert (sizeof x == sizeof y);
    __builtin_memcpy (y, &x, sizeof x);
    t = y[0]; y[0] = y[1]; y[1] = t;
    __builtin_memcpy (&x, y, sizeof x);
  }

  _ZL4swapRSt4pairIiiE:
  .LFB94:
        .cfi_startproc
        rolq    $32, (%rdi)
        ret
        .cfi_endproc

Benchmarking it shows that the ROL form is measurably faster (at least on my
machine) than the MOV form.

Reply via email to