https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124087
Bug ID: 124087
Summary: [Missed Optimization] Redundant base address
calculation with -Os (missed CSE)
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: bigmagicreadsun at gmail dot com
Target Milestone: ---
Created attachment 63660
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63660&action=edit
test.c
I am observing a missed optimization in GCC trunk (version 15.2) on RISC-V,
which is also reproducible on ARM and x86. This issue is particularly
noticeable when compiling with the -Os (optimize for size) flag.
In the function below, a local pointer aa is calculated as
&(s0->stp_ulGrant[s2]). When accessing members of this struct via aa, GCC fails
to reuse the base address. Instead, it re-calculates the absolute address (Base
+ Index * Stride) from scratch at every access point.
This behavior generates redundant instructions (li, mul, add) for each access.
In contrast, Clang trunk (21.1.0) computes the base address once and reuses it
with relative offsets. Given that aa is invariant, this is a valid case for
Common Subexpression Elimination (CSE). This redundancy is especially
problematic for -Os, as it unnecessarily increases code size.
Steps to Reproduce:
Compile the following code with gcc -Os -S (target: riscv64):
#include <stdint.h>
typedef struct {
uint8_t padding1[77];
int8_t u8_bwpIndicator;
int8_t u32_fdRa;
uint8_t padding2[25];
} MyStruct;
typedef struct {
uint8_t head[312];
MyStruct stp_ulGrant[10];
} GlobalContext;
void test_function( GlobalContext * restrict s0, int s2) {
MyStruct* restrict const aa = &(s0->stp_ulGrant[s2]);
volatile uint8_t u32_decodeResult = 0xAB;
if(s2 > 5)
{
aa->u8_bwpIndicator = (uint8_t)u32_decodeResult;
}
if(aa->u8_bwpIndicator > 0)
{
aa->u32_fdRa = (int8_t)u32_decodeResult;
}
}
Actual Output (GCC 15.2 -Os):
GCC generates the address calculation sequence (li, mul, add) twice, even
though aa is invariant. This wastes code size.
test_function:
li a5,-85
addi sp,sp,-16
sb a5,15(sp)
li a5,5
ble a1,a5,.L2
li a5,104 ; Redundant calculation #1
mul a5,a1,a5
add a5,a0,a5
sb a4,389(a5)
.L2:
li a5,104 ; Redundant calculation #2
mul a1,a1,a5
add a0,a0,a1
lb a5,389(a0)
ble a5,zero,.L1
lbu a5,15(sp)
sb a5,390(a0)
.L1:
addi sp,sp,16
jr ra
Expected Output (Clang 21.1.0 -Os):
Clang computes the base address (aa) once and reuses the register a0 with
offsets, resulting in smaller and faster code.
<ASM>
test_function:
addi sp, sp, -16
li a2, 104
li a3, 171
mul a2, a1, a2
add a0, a0, a2 ; Base address 'aa' calculated once
li a2, 5
sb a3, 15(sp)
bge a2, a1, .LBB0_2
lbu a1, 15(sp)
sb a1, 389(a0) ; Reuses 'aa' with offset
j .LBB0_3
.LBB0_2:
lbu a1, 389(a0) ; Reuses 'aa' with offset
.LBB0_3:
slli a1, a1, 24
srai a1, a1, 24
blez a1, .LBB0_5
lbu a1, 15(sp)
sb a1, 390(a0) ; Reuses 'aa' with offset
.LBB0_5:
addi sp, sp, 16
ret
Analysis:
The variable aa is a restrict const pointer derived from s0 and s2. Since both
s0 and s2 do not change within the function, the value of aa is invariant.
GCC's decision to re-emit the address calculation logic for every member access
(aa->...) is suboptimal. This is specifically a violation of the -Os goal, as
inserting 3 extra instructions (li, mul, add) multiple times significantly
bloats the binary size compared to simply keeping the calculated base address
in a register (as Clang does).
This optimization should ideally be handled by the RTL (Register Transfer
Language) alias analysis or CSE passes to ensure the base address is hoisted
and reused.