https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124087

            Bug ID: 124087
           Summary: [Missed Optimization] Redundant base address
                    calculation with -Os (missed CSE)
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bigmagicreadsun at gmail dot com
  Target Milestone: ---

Created attachment 63660
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63660&action=edit
test.c

I am observing a missed optimization in GCC trunk (version 15.2) on RISC-V,
which is also reproducible on ARM and x86. This issue is particularly
noticeable when compiling with the -Os (optimize for size) flag.

In the function below, a local pointer aa is calculated as
&(s0->stp_ulGrant[s2]). When accessing members of this struct via aa, GCC fails
to reuse the base address. Instead, it re-calculates the absolute address (Base
+ Index * Stride) from scratch at every access point.

This behavior generates redundant instructions (li, mul, add) for each access.
In contrast, Clang trunk (21.1.0) computes the base address once and reuses it
with relative offsets. Given that aa is invariant, this is a valid case for
Common Subexpression Elimination (CSE). This redundancy is especially
problematic for -Os, as it unnecessarily increases code size.

Steps to Reproduce:

Compile the following code with gcc -Os -S (target: riscv64):

#include <stdint.h>
typedef struct {
    uint8_t padding1[77];  
    int8_t u8_bwpIndicator;
    int8_t u32_fdRa;
    uint8_t padding2[25];
} MyStruct;
typedef struct {
    uint8_t head[312]; 
    MyStruct stp_ulGrant[10];
} GlobalContext;
void test_function( GlobalContext * restrict s0, int s2) {
    MyStruct* restrict const aa = &(s0->stp_ulGrant[s2]);
    volatile uint8_t u32_decodeResult = 0xAB;
    if(s2 > 5)
    {
      aa->u8_bwpIndicator = (uint8_t)u32_decodeResult;
    }
    if(aa->u8_bwpIndicator > 0)
    {
      aa->u32_fdRa = (int8_t)u32_decodeResult;
    }
}

Actual Output (GCC 15.2 -Os):

GCC generates the address calculation sequence (li, mul, add) twice, even
though aa is invariant. This wastes code size.

test_function:
        li      a5,-85
        addi    sp,sp,-16
        sb      a5,15(sp)
        li      a5,5
        ble     a1,a5,.L2
        li      a5,104              ; Redundant calculation #1
        mul     a5,a1,a5
        add     a5,a0,a5
        sb      a4,389(a5)
.L2:
        li      a5,104              ; Redundant calculation #2
        mul     a1,a1,a5
        add     a0,a0,a1
        lb      a5,389(a0)
        ble     a5,zero,.L1
        lbu     a5,15(sp)
        sb      a5,390(a0)
.L1:
        addi    sp,sp,16
        jr      ra
Expected Output (Clang 21.1.0 -Os):

Clang computes the base address (aa) once and reuses the register a0 with
offsets, resulting in smaller and faster code.

<ASM>
test_function:
        addi    sp, sp, -16
        li      a2, 104
        li      a3, 171
        mul     a2, a1, a2
        add     a0, a0, a2       ; Base address 'aa' calculated once
        li      a2, 5
        sb      a3, 15(sp)
        bge     a2, a1, .LBB0_2
        lbu     a1, 15(sp)
        sb      a1, 389(a0)      ; Reuses 'aa' with offset
        j       .LBB0_3
.LBB0_2:
        lbu     a1, 389(a0)      ; Reuses 'aa' with offset
.LBB0_3:
        slli    a1, a1, 24
        srai    a1, a1, 24
        blez    a1, .LBB0_5
        lbu     a1, 15(sp)
        sb      a1, 390(a0)      ; Reuses 'aa' with offset
.LBB0_5:
        addi    sp, sp, 16
        ret
Analysis:

The variable aa is a restrict const pointer derived from s0 and s2. Since both
s0 and s2 do not change within the function, the value of aa is invariant.

GCC's decision to re-emit the address calculation logic for every member access
(aa->...) is suboptimal. This is specifically a violation of the -Os goal, as
inserting 3 extra instructions (li, mul, add) multiple times significantly
bloats the binary size compared to simply keeping the calculated base address
in a register (as Clang does).

This optimization should ideally be handled by the RTL (Register Transfer
Language) alias analysis or CSE passes to ensure the base address is hoisted
and reused.

Reply via email to