https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125456
Bug ID: 125456
Summary: RTL Loop Unroll AGU Serialization and Redundant IV
Increments
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: lin1.hu at intel dot com
Target Milestone: ---
After pass_rtl_unroll_loops, GCC 15 and trunk emit IV-split codegen like:
.L2:
cmp (%rbx), %ecx
je .L_found
lea 4(%rbx), %rdx
cmp (%rdx), %ecx
je .L_found
lea 8(%rbx), %rdx # depends on previous lea
cmp (%rdx), %ecx
...
add $16, %rbx # the only increment really needed
jmp .L2
This codegen two issues:
1. lea + cmp + lea chain serializes AGU issue.
2. The first N-1 leas are redundant
If we (a) rewrite each copy's address as base + k*step so all copies share a
loop-invariant base, and (b) sink the per-copy IV fix-up onto the exit edge,
the codegen becomes:
.L2:
cmp (%rbx), %ecx
je .L_found_0
cmp 4(%rbx), %ecx # parallel
je .L_found_1
cmp 8(%rbx), %ecx # parallel
je .L_found_2
cmp 12(%rbx), %ecx # parallel
je .L_found_3
add $16, %rbx
jmp .L2
.L_found_0: lea (%rbx), %rax; ret
.L_found_1: lea 4(%rbx), %rax; ret
.L_found_2: lea 8(%rbx), %rax; ret
.L_found_3: lea 12(%rbx), %rax; ret
This breaks the dependency chain, drops N-1 instructions from the hot loop, and
keeps the IV live-out fix-up only on cold exit paths.
Testcase source:
int *
find_ptr (int *p, int *end, int val)
{
#pragma GCC unroll 8
while (p != end)
{
if (*p == val)
return p;
p++;
}
return 0;
}
Our current idea is to apply_opt_in_copies() in gcc/loop-unroll.cc:
1. Introduce a new pseudo register and use loop invariants to compute an
accurate base address; rewrite each copy's cmp so it reads memory directly
through this register (with a per-copy immediate offset).
2. Get the exit_edge, insert a new BB between the current BB and the final BB,
and place the IV update into that new BB.
Do you have any idea before I send a formal patch?