4-byte memcpy on strict-align targets

lasse.collin at tukaani dot org via Gcc-bugs Wed, 20 Sep 2023 13:06:51 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502


--- Comment #5 from Lasse Collin <lasse.collin at tukaani dot org> ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).

My report is about the instruction sequence used for the byte-by-byte access.

Omitting the stack pointer manipulation and return instruction, this is
bytes16:

        lbu     a5,1(a0)
        lbu     a0,0(a0)
        slli    a5,a5,8
        or      a0,a5,a0

And copy16:

        lbu     a4,0(a0)
        lbu     a5,1(a0)
        sb      a4,14(sp)
        sb      a5,15(sp)
        lhu     a0,14(sp)

Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.

PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:

bytes32:
        ldrb    w1, [x0]
        ldrb    w2, [x0, 1]
        orr     x2, x1, x2, lsl 8
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        orr     x1, x2, x1, lsl 16
        orr     w0, w1, w0, lsl 24
        ret

copy32:
        stp     x29, x30, [sp, -32]!
        mov     x1, x0
        mov     x2, 4
        mov     x29, sp
        add     x0, sp, 28
        bl      memcpy
        ldr     w0, [sp, 28]
        ldp     x29, x30, [sp], 32
        ret

And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:

bytes32:
        ldrb    w4, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w3, [x0, 3]
        orr     x2, x4, x2, lsl 8
        orr     x0, x2, x1, lsl 16
        orr     w0, w0, w3, lsl 24
        ret

copy32:
        sub     sp, sp, #16
        ldrb    w3, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        strb    w3, [sp, 12]
        strb    w2, [sp, 13]
        strb    w1, [sp, 14]
        strb    w0, [sp, 15]
        ldr     w0, [sp, 12]
        add     sp, sp, 16
        ret

ARM64 with -mstrict-align might be a contrived example in practice though.

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

Reply via email to