Hi both, I'm looking at this and will aim to get back to you soon. Sorry for not getting to this sooner.
Alex On 28/01/2026 05:26, Soumya AR wrote: > Ping. > > Thanks, > Soumya > > > On 20 Jan 2026, at 3:59 PM, Kyrylo Tkachov <[email protected]> wrote: > > > > > > > >> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote: > >> > >> > >> > >>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> > >>> wrote: > >>> > >>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: > >>>> > >>>> Ping. > >>>> > >>>> I split the files from the previous mail so it's hopefully easier to > >>>> review. > >>> > >>> I can review this but the approval won't be for until stage1. This > >>> pass at this point is too risky for this point of the release cycle. > >> > >> Thanks for any feedback you can give. FWIW we’ve been testing this > >> internally for a few months without any issues. > > > > One option to reduce the risk that Soumya’s initial patch implemented was > > to enable this only for -mcpu=olympus. We initially developed and tested it > > on that target. > > So that way it wouldn’t affect most aarch64 targets and we’d still have the > > -mno-* option to disable it as a workaround for users if it causes trouble. > > Would that be okay with you? > > Thanks, > > Kyrill > > > >> > >>> > >>> Though I also wonder how much of this can/should be done on the gimple > >>> level in a generic way. > >> > >> GIMPLE does have powerful ranger infrastructure for this, but I was > >> concerned about doing this earlier because it’s very likely that some > >> later pass could introduce extra extend operations, which would likely > >> undo the benefit of the narrowing. > >> > >> Thanks, > >> Kyrill > >> > >>> And if there is a way to get the zero-bits from the gimple level down > >>> to the RTL level still so we don't need to keep on recomputing them > >>> (this is useful for other passes too). > >>> > >>> Thanks, > >>> Andrew Pinski > >>> > >>>> > >>>> Also CC'ing Alex Coplan to this thread. > >>>> > >>>> Thanks, > >>>> Soumya > >>>> > >>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: > >>>>> > >>>>> Hi Tamar, > >>>>> > >>>>> Attaching an updated version of this patch that enables the pass at O2 > >>>>> and above > >>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. > >>>>> > >>>>> Enabling it by default at O2 touched quite a large number of tests, > >>>>> which I > >>>>> have updated in this patch. > >>>>> > >>>>> Most of the updates are straightforward, which involve changing x > >>>>> registers to > >>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). > >>>>> > >>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the > >>>>> representation of the immediate changes: > >>>>> > >>>>> mov w0, 4294927974 -> mov w0, -39322 > >>>>> > >>>>> This is because when the following RTL is narrowed to SI: > >>>>> (set (reg/i:DI 0 x0) > >>>>> (const_int 4294927974 [0xffff6666])) > >>>>> > >>>>> Due to the MSB changing to Bit 31, which is set, the output is printed > >>>>> as > >>>>> signed. > >>>>> > >>>>> Thanks, > >>>>> Soumya > >>>>> > >>>>> > >>>>> > >>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: > >>>>>> > >>>>>> External email: Use caution opening links or attachments > >>>>>> > >>>>>> > >>>>>> Ping. > >>>>>> > >>>>>> Thanks, > >>>>>> Soumya > >>>>>> > >>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: > >>>>>>> > >>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit > >>>>>>> > >>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit > >>>>>>> general purpose register operations to use 32-bit W-registers when the > >>>>>>> upper 32 bits of the register are known to be zero. > >>>>>>> > >>>>>>> This is beneficial for the Olympus core, which benefits from using > >>>>>>> 32-bit > >>>>>>> W-registers over 64-bit X-registers if possible. This is recommended > >>>>>>> by the > >>>>>>> updated Olympus Software Optimization Guide, which will be published > >>>>>>> soon. > >>>>>>> > >>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at > >>>>>>> -O2 and > >>>>>>> above, but not enabled by default, except for -mcpu=olympus. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W > >>>>>>> register > >>>>>>> that maps to its lower half. When we can guarantee that the upper 32 > >>>>>>> bits > >>>>>>> are never used, we can safely narrow operations to use W registers > >>>>>>> instead. > >>>>>>> > >>>>>>> For example, this code: > >>>>>>> uint64_t foo(uint64_t a) { > >>>>>>> return (a & 255) + 3; > >>>>>>> } > >>>>>>> > >>>>>>> Currently compiles to: > >>>>>>> and x8, x0, #0xff > >>>>>>> add x0, x8, #3 > >>>>>>> > >>>>>>> But with this pass enabled, it optimizes to: > >>>>>>> and x8, x0, #0xff > >>>>>>> add w0, w8, #3 // Using W register instead of X > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> The pass operates in two phases: > >>>>>>> > >>>>>>> 1) Analysis Phase: > >>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs) > >>>>>>> - Computes nonzero bit masks for each register definition > >>>>>>> - Recursively processes PHI nodes > >>>>>>> - Identifies candidates for narrowing > >>>>>>> 2) Transformation Phase: > >>>>>>> - Applies narrowing to validated candidates > >>>>>>> - Converts DImode operations to SImode where safe > >>>>>>> > >>>>>>> The pass runs late in the RTL pipeline, after register allocation, to > >>>>>>> ensure > >>>>>>> stable def-use chains and avoid interfering with earlier > >>>>>>> optimizations. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that > >>>>>>> recursively > >>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits > >>>>>>> has a > >>>>>>> limitation: when it encounters a register, it conservatively returns > >>>>>>> the mode > >>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in > >>>>>>> an > >>>>>>> instruction, this information can be used to refine the mask. The > >>>>>>> pass maintains > >>>>>>> a hash map of computed bit masks and installs a custom RTL hooks > >>>>>>> callback > >>>>>>> to consult this mask when encountering a register. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> PHI nodes require special handling to merge masks from all inputs. > >>>>>>> This is done > >>>>>>> by combine_mask_from_phi. 3 cases are tackled here: > >>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each > >>>>>>> input > >>>>>>> edge to the PHI, the def information is retreived and its mask is > >>>>>>> looked up. > >>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for > >>>>>>> that > >>>>>>> input. > >>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to > >>>>>>> merge the masks of all incoming values. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> When processing regular instructions, the pass first tackles SET and > >>>>>>> PARALLEL > >>>>>>> patterns with compare instructions. > >>>>>>> > >>>>>>> Single SET instructions: > >>>>>>> > >>>>>>> If the upper 32 bits of the source are known to be zero, then the > >>>>>>> instruction > >>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the > >>>>>>> source, > >>>>>>> we define narrow_dimode_src to attempt further optimizations: > >>>>>>> > >>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via > >>>>>>> simplify_gen_binary > >>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary > >>>>>>> > >>>>>>> PARALLEL Instructions (Compare + SET): > >>>>>>> > >>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) > >>>>>>> where the SET > >>>>>>> source equals the first operand of the COMPARE. Depending on the > >>>>>>> condition code > >>>>>>> for the compare, the pass checks for the required bits to be zero: > >>>>>>> > >>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits > >>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) > >>>>>>> > >>>>>>> If the instruction does not match the above patterns (or matches but > >>>>>>> cannot be > >>>>>>> optimized), the pass still analyzes all its definitions to ensure > >>>>>>> nzero_map is > >>>>>>> complete. This ensures every definition has an entry in nzero_map. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> When transforming the qualified instructions, the pass uses > >>>>>>> rtl_ssa::recog and > >>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine > >>>>>>> if the > >>>>>>> transformation is worthwhile. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances > >>>>>>> of > >>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' > >>>>>>> instructions after this pass narrows them. > >>>>>>> > >>>>>>> --- > >>>>>>> > >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no > >>>>>>> regression. > >>>>>>> OK for mainline? > >>>>>>> > >>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]> > >>>>>>> Signed-off-by: Soumya AR <[email protected]> > >>>>>>> > >>>>>>> gcc/ChangeLog: > >>>>>>> > >>>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o. > >>>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert > >>>>>>> pass_narrow_gp_writes before pass_cleanup_barriers. > >>>>>>> * config/aarch64/aarch64-tuning-flags.def > >>>>>>> (AARCH64_EXTRA_TUNING_OPTION): > >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. > >>>>>>> * config/aarch64/tuning_models/olympus.h: > >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. > >>>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): > >>>>>>> Declare. > >>>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. > >>>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. > >>>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes. > >>>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. > >>>>>>> > >>>>>>> gcc/testsuite/ChangeLog: > >>>>>>> > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. > >>>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. > >>>>>>> > >>>>>>> > >>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch> > >
