https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118550
Bug ID: 118550 Summary: Missed optimization for fusing two byte loads with offsets Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: arseny.kapoulkine at gmail dot com Target Milestone: --- When presented with the following code: uint16_t readle(const unsigned char* data, TYPE offset) { uint8_t b0 = data[offset], b1 = data[offset + 1]; return b0 | (b1 << 8); } gcc always generates inefficient code when targeting x64 that loads two bytes separately regardless of the type of offset (int, size_t, ptrdiff_t). For example, with int offset, gcc trunk generates: movsx rsi, esi movzx eax, BYTE PTR [rdi+1+rsi] movzx edx, BYTE PTR [rdi+rsi] sal eax, 8 or eax, edx clang generates efficient code that just has a single 2-byte load for all types of offset except for "unsigned int" where it needs to handle overflow. This includes size_t (where overflow is well defined, but presumably offset can never be SIZE_MAX because that would result in a pointer overflow?). For int offset, clang generates: movsxd rax, esi movzx eax, word ptr [rdi + rax] See https://gcc.godbolt.org/z/6fcnedqPM for a full comparison of different types