https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124625

            Bug ID: 124625
           Summary: Missed ubfx + shifted addressing for (x >> N) & MASK
                    when MASK has trailing zeros
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ptomsich at gcc dot gnu.org
  Target Milestone: ---

Description

  When the result of (x >> N) & MASK is used as a byte offset into memory and
MASK has trailing zeros (i.e., MASK = C << B), GCC does not decompose the
expression into ((x >> (N+B)) & C) << B. This
  prevents the AArch64 backend from using ubfx (bitfield extract) and folding
the shift into the ldr addressing mode.

  Reproducer:

  long f(long occ, long base, long offset) {
    long idx = (occ >> 54) & 504;  /* 504 = 63 << 3 */
    return *(long *)(base + idx + offset);
  }

  Current output (GCC 16.0.1, -O2):

  f:
          asr     x0, x0, 54
          add     x1, x1, x2
          and     x0, x0, 504
          ldr     x0, [x1, x0]
          ret

  4 instructions.  The asr + and pair extracts bits 54-59 but leaves them
shifted left by 3, preventing the backend from recognizing a bitfield extract
or folding the scale into the load.

  Expected output:

  f:
          add     x1, x1, x2
          ubfx    x0, x0, 57, 6
          ldr     x0, [x1, x0, lsl 3]
          ret

  3 instructions.  Decomposing (x >> 54) & 504 into ((x >> 57) & 63) << 3
exposes two opportunities:

  1. (x >> 57) & 63 is a 6-bit field extraction at position 57 -- maps to ubfx
x0, x0, 57, 6 (1 instruction instead of asr + and = 2).
  2. The << 3 folds into the ldr addressing mode as ldr x0, [x1, x0, lsl 3] (no
extra instruction).

  Verification that GCC can generate the optimal code:

  Writing the decomposed form by hand confirms the backend is ready:

  long f_manual(long occ, long base, long offset) {
    int idx = (occ >> 57) & 63;
    return *(long *)(base + offset + (long)idx * 8);
  }

  produces:

  f_manual:
          add     x1, x1, x2
          ubfx    x0, x0, 57, 6
          ldr     x0, [x1, w0, sxtw 3]
          ret

  3 instructions -- the backend already handles the decomposed form correctly.

  Root cause:

  The combine pass (combine.cc, make_compound_operation_int()) handles (and
(lshiftrt X A) MASK) but only when MASK is 2^N - 1 (consecutive 1's from bit
0):

  if (GET_CODE (XEXP (x, 0)) == LSHIFTRT
      && (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0)

  For MASK = 504 = 0x1f8: exact_log2(505) returns -1, so no decomposition is
attempted.  The missing case is masks with trailing zeros where factoring them
out would expose a bitfield extract plus a
  shift that the backend can fold into addressing modes or shift-add
instructions.

  Affected targets:

  This is a target-independent issue in combine.  Other targets also benefit
from the decomposition:

  - RISC-V (Zba): exposes sh3add (shift-add) opportunities
  - x86-64: enables scaled index addressing ([base + idx*8]), though the
instruction count is the same

  AArch64 benefits the most because ubfx replaces two instructions and the
shifted ldr addressing mode absorbs the scale.

  Real-world occurrence:

  This pattern appears in chess engines (deepsjeng BishopAttacks) where
bitboard lookup tables are indexed by masked shift results.  The hot function
contains 2 instances.

Reply via email to