https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
--- Comment #5 from Lasse Collin <lasse.collin at tukaani dot org> --- If I understood correctly, PR 50417 is about wishing that GCC would infer that a pointer given to memcpy has alignment higher than one. In my examples the alignment of the uint8_t *b argument is one and thus byte-by-byte access is needed (if the target processor doesn't have fast unaligned access, determined from -mtune and -mno-strict-align). My report is about the instruction sequence used for the byte-by-byte access. Omitting the stack pointer manipulation and return instruction, this is bytes16: lbu a5,1(a0) lbu a0,0(a0) slli a5,a5,8 or a0,a5,a0 And copy16: lbu a4,0(a0) lbu a5,1(a0) sb a4,14(sp) sb a5,15(sp) lhu a0,14(sp) Is the latter as good code as the former? If yes, then this report might be invalid and I apologize for the noise. PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to memcpy, so that is the same detail as the -Os case in my first message. Calling memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with -Os -mstrict-align the call doesn't save space: bytes32: ldrb w1, [x0] ldrb w2, [x0, 1] orr x2, x1, x2, lsl 8 ldrb w1, [x0, 2] ldrb w0, [x0, 3] orr x1, x2, x1, lsl 16 orr w0, w1, w0, lsl 24 ret copy32: stp x29, x30, [sp, -32]! mov x1, x0 mov x2, 4 mov x29, sp add x0, sp, 28 bl memcpy ldr w0, [sp, 28] ldp x29, x30, [sp], 32 ret And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too: bytes32: ldrb w4, [x0] ldrb w2, [x0, 1] ldrb w1, [x0, 2] ldrb w3, [x0, 3] orr x2, x4, x2, lsl 8 orr x0, x2, x1, lsl 16 orr w0, w0, w3, lsl 24 ret copy32: sub sp, sp, #16 ldrb w3, [x0] ldrb w2, [x0, 1] ldrb w1, [x0, 2] ldrb w0, [x0, 3] strb w3, [sp, 12] strb w2, [sp, 13] strb w1, [sp, 14] strb w0, [sp, 15] ldr w0, [sp, 12] add sp, sp, 16 ret ARM64 with -mstrict-align might be a contrived example in practice though.