Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Alex Coplan Thu, 26 Feb 2026 04:35:39 -0800

On 25/02/2026 11:16, Richard Sandiford wrote:
> Soumya AR <[email protected]> writes:
> > [...]
> > In AArch64, each 64-bit X register has a corresponding 32-bit W register
> > that maps to its lower half.  When we can guarantee that the upper 32 bits
> > are never used, we can safely narrow operations to use W registers instead.
> >
> > For example, this code:
> >     uint64_t foo (uint64_t a) {
> >         return (a & 255) + 3;
> >     }
> >
> > Currently compiles to:
> >
> >     and     x0, x0, 255
> >     add     x0, x0, 3
> >     ret
> >
> > But with this pass enabled, it optimizes to:
> >
> >     and     w0, w0, 255
> >     add     w0, w0, 3
> >     ret
> >
> > ----
> >
> > The pass operates in two phases:
> >
> >  1) Analysis Phase:
> >    - Using RTL-SSA, iterates through extended basic blocks (EBBs)
> >    - Computes nonzero bit masks for each register definition
> >    - Recursively processes PHI nodes
> >    - Identifies candidates for narrowing
> >  2) Transformation Phase:
> >    - Applies narrowing to validated candidates
> >    - Converts DImode operations to SImode where safe
> >
> > The pass runs late in the RTL pipeline, after register allocation, to ensure
> > stable def-use chains and avoid interfering with earlier optimizations.
> 
> I haven't looked at the implementation in detail yet, but on the design:
> 
> As you say above, the pass makes a single pass through the instructions,
> making pessimistic assumptions about backedges.  Did you consider instead
> using a worklist algorithm that makes optimistic assumptions and then
> corrects them?  That would cope better with loops.


Having looked at the implementation in a bit of detail, I would tend to
agree.  As written the pass seems to be a halfway house between a local
analysis and a global analysis (trying to do some of the work of the
latter without really getting the benefits).  It tries to follow
backedges with the phi-chasing in combine_mask_from_phi, but ISTM this
phi-chasing is mostly redundant in that if we follow a backedge to an
RPO-later phi, then it is likely that we haven't yet processed all the
inputs to that phi anyway (given we do a single pass over the RPO), so
it seems we are unlikely to successfully narrow the masks for loop phis
in this way.

Moreover the recursive phi-chasing necessitates the use of the visited
bitmap which leads to quadratic behaviour with the call to bitmap_clear
in the main loop.

So IMO the options are:
1. Simplify the pass to avoid the recursive phi-chasing altogether,
   accepting that we won't propagate narrowing information across
   backedges.  This should still allow narrowing of phi masks at join
   points, but not for cycles.  I think this should still handle most
   of the opportunities handled by the current pass (while being more
   compile-time efficient).
2. Follow Richard's suggestion above to extend the pass into a full
   iterative global analysis, which actually tries to propagate
   information across backedges.

Either seems fine from my POV (althouh the latter should mean more
opportunities can be handled).

Sorry it's taken me so long to get back to you on this.  I do still want
to post a full review of the parts I've looked at so far, although that
may be less useful if we decide to change direction on the design.

Thanks,
Alex

> 
> With that arrangement, the worklist/analysis phase would not make
> optimisations on the fly.  It would simply record a mask for each
> definition (e.g. in a map), making optimistic assumptions about
> definitions that have not been processed yet.  If the mask calculated
> for a definition invalidates an earlier assumption, the definition
> would be pushed onto the worklist so that all its uses could be
> reevaluated.  (See gimple-ssa-backprop for another pass that works
> like this, although I'm sure there are simpler examples.)
> 
> That analysis framework seems generic rather than target-specific.
> Perhaps it should be separated from the pass and provided as an
> independent routine, so that multiple passes can use it.  (I'd
> wondered at one point whether late-combine should do the same kind
> of analysis, but never got around to trying it.)
> 
> Thanks,
> Richard

Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to