On 25/02/2026 11:16, Richard Sandiford wrote: > Soumya AR <[email protected]> writes: > > [...] > > In AArch64, each 64-bit X register has a corresponding 32-bit W register > > that maps to its lower half. When we can guarantee that the upper 32 bits > > are never used, we can safely narrow operations to use W registers instead. > > > > For example, this code: > > uint64_t foo (uint64_t a) { > > return (a & 255) + 3; > > } > > > > Currently compiles to: > > > > and x0, x0, 255 > > add x0, x0, 3 > > ret > > > > But with this pass enabled, it optimizes to: > > > > and w0, w0, 255 > > add w0, w0, 3 > > ret > > > > ---- > > > > The pass operates in two phases: > > > > 1) Analysis Phase: > > - Using RTL-SSA, iterates through extended basic blocks (EBBs) > > - Computes nonzero bit masks for each register definition > > - Recursively processes PHI nodes > > - Identifies candidates for narrowing > > 2) Transformation Phase: > > - Applies narrowing to validated candidates > > - Converts DImode operations to SImode where safe > > > > The pass runs late in the RTL pipeline, after register allocation, to ensure > > stable def-use chains and avoid interfering with earlier optimizations. > > I haven't looked at the implementation in detail yet, but on the design: > > As you say above, the pass makes a single pass through the instructions, > making pessimistic assumptions about backedges. Did you consider instead > using a worklist algorithm that makes optimistic assumptions and then > corrects them? That would cope better with loops.
Having looked at the implementation in a bit of detail, I would tend to agree. As written the pass seems to be a halfway house between a local analysis and a global analysis (trying to do some of the work of the latter without really getting the benefits). It tries to follow backedges with the phi-chasing in combine_mask_from_phi, but ISTM this phi-chasing is mostly redundant in that if we follow a backedge to an RPO-later phi, then it is likely that we haven't yet processed all the inputs to that phi anyway (given we do a single pass over the RPO), so it seems we are unlikely to successfully narrow the masks for loop phis in this way. Moreover the recursive phi-chasing necessitates the use of the visited bitmap which leads to quadratic behaviour with the call to bitmap_clear in the main loop. So IMO the options are: 1. Simplify the pass to avoid the recursive phi-chasing altogether, accepting that we won't propagate narrowing information across backedges. This should still allow narrowing of phi masks at join points, but not for cycles. I think this should still handle most of the opportunities handled by the current pass (while being more compile-time efficient). 2. Follow Richard's suggestion above to extend the pass into a full iterative global analysis, which actually tries to propagate information across backedges. Either seems fine from my POV (althouh the latter should mean more opportunities can be handled). Sorry it's taken me so long to get back to you on this. I do still want to post a full review of the parts I've looked at so far, although that may be less useful if we decide to change direction on the design. Thanks, Alex > > With that arrangement, the worklist/analysis phase would not make > optimisations on the fly. It would simply record a mask for each > definition (e.g. in a map), making optimistic assumptions about > definitions that have not been processed yet. If the mask calculated > for a definition invalidates an earlier assumption, the definition > would be pushed onto the worklist so that all its uses could be > reevaluated. (See gimple-ssa-backprop for another pass that works > like this, although I'm sure there are simpler examples.) > > That analysis framework seems generic rather than target-specific. > Perhaps it should be separated from the pass and provided as an > independent routine, so that multiple passes can use it. (I'd > wondered at one point whether late-combine should do the same kind > of analysis, but never got around to trying it.) > > Thanks, > Richard
