https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123763

            Bug ID: 123763
           Summary: Suboptimal code for some 64-bit loads on 32-bit ARM
                    Cortex M
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: david at westcontrol dot com
  Target Milestone: ---

godbolt link: <https://godbolt.org/z/6jKfT4v4j>

I have been experimenting a little with 64-bit types on small 32-bit ARM Cortex
M microcontrollers.  uint64_t is 8-byte aligned in the ABI, which is
unnecessary on these chips and can lead to wasted space from padding in
structs.  (If you only have a few KB of ram, 4 bytes for padding is
significant.)

It's no problem creating a type that works as uint64_t, but has 4 byte
alignment:

typedef __attribute__((aligned(4))) uint64_t uint64_a4;

But accessing data of this type is sometimes less efficient than with a normal
uint64_t (or its underlying type, unsigned long long int).  For these devices,
there is no hardware difference for 4-byte or 8-byte alignment, and no reason
for a difference in the generated code.


uint64_t foo8(const uint64_t * p) {
    return *p;
}
uint64_t u8;
uint64_t bar8() { return u8; }


gives optimal Cortex-M0+ code:

foo8:
        ldmia   r0, {r0, r1}
        bx      lr

bar8:
        ldr     r3, .L6
        ldmia   r3!, {r0, r1}
        bx      lr


But using a 4-byte aligned type does not:

uint64_t foo4(const uint64_a4 * p) {
    return *p;
}
uint64_a4 u4;
uint64_t bar4() { return u4; }

foo4:
        movs    r3, r0
        ldmia   r3!, {r0, r1}
        bx      lr

bar4:
        ldr     r3, .L9
        ldr     r0, [r3, #8]
        ldr     r1, [r3, #12]
        bx      lr

On the Cortex-M4 (and other "bigger" Cortex-M devices), the compiler can use
the "ldrd" double register load instruction for optimal code, even for the
4-byte aligned type.  If I make a 2-byte aligned type for testing purposes, the
compiler must use two separate "ldr" loads on the Cortex-M4 as "ldrd" requires
4-byte alignment.  (On the Cortex-M0+, even "ldr" requires 4-byte alignment, so
the code there must use 16-bit loads.)  But again, the 4-byte loads are done
using an unnecessary extra register: 

typedef __attribute__((aligned(2))) uint64_t uint64_a2;
uint64_t foo2(const uint64_a2 * p) {
    return *p;
}
uint64_a2 u2;
uint64_t bar2() { return u2; }

On the Cortex-M4, this gives:

foo8:
        ldrd    r0, [r0]
        bx      lr
foo4:
        ldrd    r0, r1, [r0]
        bx      lr
foo2:
        mov     r3, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r3, #4]      @ unaligned
        bx      lr
bar8:
        ldr     r3, .L6
        ldrd    r0, [r3]
        bx      lr
bar4:
        ldr     r3, .L9
        ldrd    r0, r1, [r3, #8]
        bx      lr
bar2:
        ldr     r3, .L12
        ldr     r0, [r3, #16]     @ unaligned
        ldr     r1, [r3, #20]     @ unaligned
        bx      lr

All the code is correct (that's always the most important thing), but such
inefficiencies add up.  It is not hard to find other examples of unnecessarily
using an additional pointer register when dealing with data bigger than
32-bits, such as with structs - adding "aligned" attributes are not necessary
to show the problem, but gave clear and simple examples here.

Reply via email to