Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Alex Coplan Thu, 29 Jan 2026 02:47:07 -0800

Hi both,

I'm looking at this and will aim to get back to you soon.  Sorry for not getting
to this sooner.


Alex

On 28/01/2026 05:26, Soumya AR wrote:
> Ping.
> 
> Thanks,
> Soumya
> 
> > On 20 Jan 2026, at 3:59 PM, Kyrylo Tkachov <[email protected]> wrote:
> > 
> > 
> > 
> >> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote:
> >> 
> >> 
> >> 
> >>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> 
> >>> wrote:
> >>> 
> >>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote:
> >>>> 
> >>>> Ping.
> >>>> 
> >>>> I split the files from the previous mail so it's hopefully easier to 
> >>>> review.
> >>> 
> >>> I can review this but the approval won't be for until stage1. This
> >>> pass at this point is too risky for this point of the release cycle.
> >> 
> >> Thanks for any feedback you can give. FWIW we’ve been testing this 
> >> internally for a few months without any issues.
> > 
> > One option to reduce the risk that Soumya’s initial patch implemented was 
> > to enable this only for -mcpu=olympus. We initially developed and tested it 
> > on that target.
> > So that way it wouldn’t affect most aarch64 targets and we’d still have the 
> > -mno-* option to disable it as a workaround for users if it causes trouble.
> > Would that be okay with you?
> > Thanks,
> > Kyrill
> > 
> >> 
> >>> 
> >>> Though I also wonder how much of this can/should be done on the gimple
> >>> level in a generic way.
> >> 
> >> GIMPLE does have powerful ranger infrastructure for this, but I was 
> >> concerned about doing this earlier because it’s very likely that some 
> >> later pass could introduce extra extend operations, which would likely 
> >> undo the benefit of the narrowing.
> >> 
> >> Thanks,
> >> Kyrill
> >> 
> >>> And if there is a way to get the zero-bits from the gimple level down
> >>> to the RTL level still so we don't need to keep on recomputing them
> >>> (this is useful for other passes too).
> >>> 
> >>> Thanks,
> >>> Andrew Pinski
> >>> 
> >>>> 
> >>>> Also CC'ing Alex Coplan to this thread.
> >>>> 
> >>>> Thanks,
> >>>> Soumya
> >>>> 
> >>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote:
> >>>>> 
> >>>>> Hi Tamar,
> >>>>> 
> >>>>> Attaching an updated version of this patch that enables the pass at O2 
> >>>>> and above
> >>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes.
> >>>>> 
> >>>>> Enabling it by default at O2 touched quite a large number of tests, 
> >>>>> which I
> >>>>> have updated in this patch.
> >>>>> 
> >>>>> Most of the updates are straightforward, which involve changing x 
> >>>>> registers to
> >>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+).
> >>>>> 
> >>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
> >>>>> representation of the immediate changes:
> >>>>> 
> >>>>>      mov w0, 4294927974 -> mov w0, -39322
> >>>>> 
> >>>>> This is because when the following RTL is narrowed to SI:
> >>>>>      (set (reg/i:DI 0 x0)
> >>>>>              (const_int 4294927974 [0xffff6666]))
> >>>>> 
> >>>>> Due to the MSB changing to Bit 31, which is set, the output is printed 
> >>>>> as
> >>>>> signed.
> >>>>> 
> >>>>> Thanks,
> >>>>> Soumya
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote:
> >>>>>> 
> >>>>>> External email: Use caution opening links or attachments
> >>>>>> 
> >>>>>> 
> >>>>>> Ping.
> >>>>>> 
> >>>>>> Thanks,
> >>>>>> Soumya
> >>>>>> 
> >>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote:
> >>>>>>> 
> >>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit
> >>>>>>> 
> >>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit
> >>>>>>> general purpose register operations to use 32-bit W-registers when the
> >>>>>>> upper 32 bits of the register are known to be zero.
> >>>>>>> 
> >>>>>>> This is beneficial for the Olympus core, which benefits from using 
> >>>>>>> 32-bit
> >>>>>>> W-registers over 64-bit X-registers if possible. This is recommended 
> >>>>>>> by the
> >>>>>>> updated Olympus Software Optimization Guide, which will be published 
> >>>>>>> soon.
> >>>>>>> 
> >>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at 
> >>>>>>> -O2 and
> >>>>>>> above, but not enabled by default, except for -mcpu=olympus.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W 
> >>>>>>> register
> >>>>>>> that maps to its lower half.  When we can guarantee that the upper 32 
> >>>>>>> bits
> >>>>>>> are never used, we can safely narrow operations to use W registers 
> >>>>>>> instead.
> >>>>>>> 
> >>>>>>> For example, this code:
> >>>>>>> uint64_t foo(uint64_t a) {
> >>>>>>>  return (a & 255) + 3;
> >>>>>>> }
> >>>>>>> 
> >>>>>>> Currently compiles to:
> >>>>>>> and x8, x0, #0xff
> >>>>>>> add x0, x8, #3
> >>>>>>> 
> >>>>>>> But with this pass enabled, it optimizes to:
> >>>>>>> and x8, x0, #0xff
> >>>>>>> add w0, w8, #3      // Using W register instead of X
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> The pass operates in two phases:
> >>>>>>> 
> >>>>>>> 1) Analysis Phase:
> >>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs)
> >>>>>>> - Computes nonzero bit masks for each register definition
> >>>>>>> - Recursively processes PHI nodes
> >>>>>>> - Identifies candidates for narrowing
> >>>>>>> 2) Transformation Phase:
> >>>>>>> - Applies narrowing to validated candidates
> >>>>>>> - Converts DImode operations to SImode where safe
> >>>>>>> 
> >>>>>>> The pass runs late in the RTL pipeline, after register allocation, to 
> >>>>>>> ensure
> >>>>>>> stable def-use chains and avoid interfering with earlier 
> >>>>>>> optimizations.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that 
> >>>>>>> recursively
> >>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits 
> >>>>>>> has a
> >>>>>>> limitation: when it encounters a register, it conservatively returns 
> >>>>>>> the mode
> >>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in 
> >>>>>>> an
> >>>>>>> instruction, this information can be used to refine the mask. The 
> >>>>>>> pass maintains
> >>>>>>> a hash map of computed bit masks and installs a custom RTL hooks 
> >>>>>>> callback
> >>>>>>> to consult this mask when encountering a register.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> PHI nodes require special handling to merge masks from all inputs. 
> >>>>>>> This is done
> >>>>>>> by combine_mask_from_phi. 3 cases are tackled here:
> >>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each 
> >>>>>>> input
> >>>>>>> edge to the PHI, the def information is retreived and its mask is 
> >>>>>>> looked up.
> >>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for 
> >>>>>>> that
> >>>>>>> input.
> >>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
> >>>>>>> merge the masks of all incoming values.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> When processing regular instructions, the pass first tackles SET and 
> >>>>>>> PARALLEL
> >>>>>>> patterns with compare instructions.
> >>>>>>> 
> >>>>>>> Single SET instructions:
> >>>>>>> 
> >>>>>>> If the upper 32 bits of the source are known to be zero, then the 
> >>>>>>> instruction
> >>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the 
> >>>>>>> source,
> >>>>>>> we define narrow_dimode_src to attempt further optimizations:
> >>>>>>> 
> >>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via 
> >>>>>>> simplify_gen_binary
> >>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary
> >>>>>>> 
> >>>>>>> PARALLEL Instructions (Compare + SET):
> >>>>>>> 
> >>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) 
> >>>>>>> where the SET
> >>>>>>> source equals the first operand of the COMPARE. Depending on the 
> >>>>>>> condition code
> >>>>>>> for the compare, the pass checks for the required bits to be zero:
> >>>>>>> 
> >>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits
> >>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
> >>>>>>> 
> >>>>>>> If the instruction does not match the above patterns (or matches but 
> >>>>>>> cannot be
> >>>>>>> optimized), the pass still analyzes all its definitions to ensure 
> >>>>>>> nzero_map is
> >>>>>>> complete. This ensures every definition has an entry in nzero_map.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> When transforming the qualified instructions, the pass uses 
> >>>>>>> rtl_ssa::recog and
> >>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine 
> >>>>>>> if the
> >>>>>>> transformation is worthwhile.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances 
> >>>>>>> of
> >>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
> >>>>>>> instructions after this pass narrows them.
> >>>>>>> 
> >>>>>>> ---
> >>>>>>> 
> >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
> >>>>>>> regression.
> >>>>>>> OK for mainline?
> >>>>>>> 
> >>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]>
> >>>>>>> Signed-off-by: Soumya AR <[email protected]>
> >>>>>>> 
> >>>>>>> gcc/ChangeLog:
> >>>>>>> 
> >>>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o.
> >>>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
> >>>>>>> pass_narrow_gp_writes before pass_cleanup_barriers.
> >>>>>>> * config/aarch64/aarch64-tuning-flags.def 
> >>>>>>> (AARCH64_EXTRA_TUNING_OPTION):
> >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
> >>>>>>> * config/aarch64/tuning_models/olympus.h:
> >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
> >>>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): 
> >>>>>>> Declare.
> >>>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
> >>>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
> >>>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes.
> >>>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
> >>>>>>> 
> >>>>>>> gcc/testsuite/ChangeLog:
> >>>>>>> 
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>
> 
>

Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to