https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125456

            Bug ID: 125456
           Summary: RTL Loop Unroll AGU Serialization and Redundant IV
                    Increments
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lin1.hu at intel dot com
  Target Milestone: ---

After pass_rtl_unroll_loops, GCC 15 and trunk emit IV-split codegen like:

  .L2:
      cmp  (%rbx), %ecx
      je   .L_found
      lea  4(%rbx), %rdx
      cmp  (%rdx), %ecx
      je   .L_found
      lea  8(%rbx), %rdx      # depends on previous lea
      cmp  (%rdx), %ecx
      ...
      add  $16, %rbx          # the only increment really needed
      jmp  .L2

This codegen two issues:
1. lea + cmp + lea chain serializes AGU issue.
2. The first N-1 leas are redundant

If we (a) rewrite each copy's address as base + k*step so all copies share a
loop-invariant base, and (b) sink the per-copy IV fix-up onto the exit edge,
the codegen becomes:
  .L2:
      cmp  (%rbx), %ecx
      je   .L_found_0
      cmp  4(%rbx), %ecx        # parallel
      je   .L_found_1
      cmp  8(%rbx), %ecx        # parallel
      je   .L_found_2
      cmp  12(%rbx), %ecx       # parallel
      je   .L_found_3
      add  $16, %rbx
      jmp  .L2
  .L_found_0: lea  (%rbx), %rax; ret
  .L_found_1: lea  4(%rbx), %rax; ret
  .L_found_2: lea  8(%rbx), %rax; ret
  .L_found_3: lea  12(%rbx), %rax; ret

This breaks the dependency chain, drops N-1 instructions from the hot loop, and
keeps the IV live-out fix-up only on cold exit paths.

Testcase source:
  int *
  find_ptr (int *p, int *end, int val)
  {
  #pragma GCC unroll 8
    while (p != end)
      {
        if (*p == val)
          return p;
        p++;
      }
    return 0;
  }

Our current idea is to apply_opt_in_copies() in gcc/loop-unroll.cc:

1. Introduce a new pseudo register and use loop invariants to compute an
accurate base address; rewrite each copy's cmp so it reads memory directly
through this register (with a per-copy immediate offset).
2. Get the exit_edge, insert a new BB between the current BB and the final BB,
and place the IV update into that new BB.

Do you have any idea before I send a formal patch?

Reply via email to