https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118550

            Bug ID: 118550
           Summary: Missed optimization for fusing two byte loads with
                    offsets
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: arseny.kapoulkine at gmail dot com
  Target Milestone: ---

When presented with the following code:

uint16_t readle(const unsigned char* data, TYPE offset)
{
    uint8_t b0 = data[offset], b1 = data[offset + 1];
    return b0 | (b1 << 8);
}

gcc always generates inefficient code when targeting x64 that loads two bytes
separately regardless of the type of offset (int, size_t, ptrdiff_t).

For example, with int offset, gcc trunk generates:

        movsx   rsi, esi
        movzx   eax, BYTE PTR [rdi+1+rsi]
        movzx   edx, BYTE PTR [rdi+rsi]
        sal     eax, 8
        or      eax, edx


clang generates efficient code that just has a single 2-byte load for all types
of offset except for "unsigned int" where it needs to handle overflow. This
includes size_t (where overflow is well defined, but presumably offset can
never be SIZE_MAX because that would result in a pointer overflow?). For int
offset, clang generates:

        movsxd  rax, esi
        movzx   eax, word ptr [rdi + rax]

See https://gcc.godbolt.org/z/6fcnedqPM for a full comparison of different
types

Reply via email to