Ping. Thanks, Soumya
> On 20 Jan 2026, at 3:59 PM, Kyrylo Tkachov <[email protected]> wrote: > > > >> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote: >> >> >> >>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> >>> wrote: >>> >>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: >>>> >>>> Ping. >>>> >>>> I split the files from the previous mail so it's hopefully easier to >>>> review. >>> >>> I can review this but the approval won't be for until stage1. This >>> pass at this point is too risky for this point of the release cycle. >> >> Thanks for any feedback you can give. FWIW we’ve been testing this >> internally for a few months without any issues. > > One option to reduce the risk that Soumya’s initial patch implemented was to > enable this only for -mcpu=olympus. We initially developed and tested it on > that target. > So that way it wouldn’t affect most aarch64 targets and we’d still have the > -mno-* option to disable it as a workaround for users if it causes trouble. > Would that be okay with you? > Thanks, > Kyrill > >> >>> >>> Though I also wonder how much of this can/should be done on the gimple >>> level in a generic way. >> >> GIMPLE does have powerful ranger infrastructure for this, but I was >> concerned about doing this earlier because it’s very likely that some later >> pass could introduce extra extend operations, which would likely undo the >> benefit of the narrowing. >> >> Thanks, >> Kyrill >> >>> And if there is a way to get the zero-bits from the gimple level down >>> to the RTL level still so we don't need to keep on recomputing them >>> (this is useful for other passes too). >>> >>> Thanks, >>> Andrew Pinski >>> >>>> >>>> Also CC'ing Alex Coplan to this thread. >>>> >>>> Thanks, >>>> Soumya >>>> >>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: >>>>> >>>>> Hi Tamar, >>>>> >>>>> Attaching an updated version of this patch that enables the pass at O2 >>>>> and above >>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. >>>>> >>>>> Enabling it by default at O2 touched quite a large number of tests, which >>>>> I >>>>> have updated in this patch. >>>>> >>>>> Most of the updates are straightforward, which involve changing x >>>>> registers to >>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). >>>>> >>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the >>>>> representation of the immediate changes: >>>>> >>>>> mov w0, 4294927974 -> mov w0, -39322 >>>>> >>>>> This is because when the following RTL is narrowed to SI: >>>>> (set (reg/i:DI 0 x0) >>>>> (const_int 4294927974 [0xffff6666])) >>>>> >>>>> Due to the MSB changing to Bit 31, which is set, the output is printed as >>>>> signed. >>>>> >>>>> Thanks, >>>>> Soumya >>>>> >>>>> >>>>> >>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: >>>>>> >>>>>> External email: Use caution opening links or attachments >>>>>> >>>>>> >>>>>> Ping. >>>>>> >>>>>> Thanks, >>>>>> Soumya >>>>>> >>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: >>>>>>> >>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit >>>>>>> >>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit >>>>>>> general purpose register operations to use 32-bit W-registers when the >>>>>>> upper 32 bits of the register are known to be zero. >>>>>>> >>>>>>> This is beneficial for the Olympus core, which benefits from using >>>>>>> 32-bit >>>>>>> W-registers over 64-bit X-registers if possible. This is recommended by >>>>>>> the >>>>>>> updated Olympus Software Optimization Guide, which will be published >>>>>>> soon. >>>>>>> >>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at >>>>>>> -O2 and >>>>>>> above, but not enabled by default, except for -mcpu=olympus. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W register >>>>>>> that maps to its lower half. When we can guarantee that the upper 32 >>>>>>> bits >>>>>>> are never used, we can safely narrow operations to use W registers >>>>>>> instead. >>>>>>> >>>>>>> For example, this code: >>>>>>> uint64_t foo(uint64_t a) { >>>>>>> return (a & 255) + 3; >>>>>>> } >>>>>>> >>>>>>> Currently compiles to: >>>>>>> and x8, x0, #0xff >>>>>>> add x0, x8, #3 >>>>>>> >>>>>>> But with this pass enabled, it optimizes to: >>>>>>> and x8, x0, #0xff >>>>>>> add w0, w8, #3 // Using W register instead of X >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> The pass operates in two phases: >>>>>>> >>>>>>> 1) Analysis Phase: >>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs) >>>>>>> - Computes nonzero bit masks for each register definition >>>>>>> - Recursively processes PHI nodes >>>>>>> - Identifies candidates for narrowing >>>>>>> 2) Transformation Phase: >>>>>>> - Applies narrowing to validated candidates >>>>>>> - Converts DImode operations to SImode where safe >>>>>>> >>>>>>> The pass runs late in the RTL pipeline, after register allocation, to >>>>>>> ensure >>>>>>> stable def-use chains and avoid interfering with earlier optimizations. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that >>>>>>> recursively >>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits >>>>>>> has a >>>>>>> limitation: when it encounters a register, it conservatively returns >>>>>>> the mode >>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in an >>>>>>> instruction, this information can be used to refine the mask. The pass >>>>>>> maintains >>>>>>> a hash map of computed bit masks and installs a custom RTL hooks >>>>>>> callback >>>>>>> to consult this mask when encountering a register. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> PHI nodes require special handling to merge masks from all inputs. This >>>>>>> is done >>>>>>> by combine_mask_from_phi. 3 cases are tackled here: >>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each >>>>>>> input >>>>>>> edge to the PHI, the def information is retreived and its mask is >>>>>>> looked up. >>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for that >>>>>>> input. >>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to >>>>>>> merge the masks of all incoming values. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> When processing regular instructions, the pass first tackles SET and >>>>>>> PARALLEL >>>>>>> patterns with compare instructions. >>>>>>> >>>>>>> Single SET instructions: >>>>>>> >>>>>>> If the upper 32 bits of the source are known to be zero, then the >>>>>>> instruction >>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the >>>>>>> source, >>>>>>> we define narrow_dimode_src to attempt further optimizations: >>>>>>> >>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via >>>>>>> simplify_gen_binary >>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary >>>>>>> >>>>>>> PARALLEL Instructions (Compare + SET): >>>>>>> >>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where >>>>>>> the SET >>>>>>> source equals the first operand of the COMPARE. Depending on the >>>>>>> condition code >>>>>>> for the compare, the pass checks for the required bits to be zero: >>>>>>> >>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits >>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) >>>>>>> >>>>>>> If the instruction does not match the above patterns (or matches but >>>>>>> cannot be >>>>>>> optimized), the pass still analyzes all its definitions to ensure >>>>>>> nzero_map is >>>>>>> complete. This ensures every definition has an entry in nzero_map. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> When transforming the qualified instructions, the pass uses >>>>>>> rtl_ssa::recog and >>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine >>>>>>> if the >>>>>>> transformation is worthwhile. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances of >>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' >>>>>>> instructions after this pass narrows them. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>>>> regression. >>>>>>> OK for mainline? >>>>>>> >>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]> >>>>>>> Signed-off-by: Soumya AR <[email protected]> >>>>>>> >>>>>>> gcc/ChangeLog: >>>>>>> >>>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o. >>>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert >>>>>>> pass_narrow_gp_writes before pass_cleanup_barriers. >>>>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. >>>>>>> * config/aarch64/tuning_models/olympus.h: >>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. >>>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare. >>>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. >>>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. >>>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes. >>>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. >>>>>>> >>>>>>> gcc/testsuite/ChangeLog: >>>>>>> >>>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. >>>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. >>>>>>> >>>>>>> >>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>
