Soumya AR <[email protected]> writes: > [...] > In AArch64, each 64-bit X register has a corresponding 32-bit W register > that maps to its lower half. When we can guarantee that the upper 32 bits > are never used, we can safely narrow operations to use W registers instead. > > For example, this code: > uint64_t foo (uint64_t a) { > return (a & 255) + 3; > } > > Currently compiles to: > > and x0, x0, 255 > add x0, x0, 3 > ret > > But with this pass enabled, it optimizes to: > > and w0, w0, 255 > add w0, w0, 3 > ret > > ---- > > The pass operates in two phases: > > 1) Analysis Phase: > - Using RTL-SSA, iterates through extended basic blocks (EBBs) > - Computes nonzero bit masks for each register definition > - Recursively processes PHI nodes > - Identifies candidates for narrowing > 2) Transformation Phase: > - Applies narrowing to validated candidates > - Converts DImode operations to SImode where safe > > The pass runs late in the RTL pipeline, after register allocation, to ensure > stable def-use chains and avoid interfering with earlier optimizations.
I haven't looked at the implementation in detail yet, but on the design: As you say above, the pass makes a single pass through the instructions, making pessimistic assumptions about backedges. Did you consider instead using a worklist algorithm that makes optimistic assumptions and then corrects them? That would cope better with loops. With that arrangement, the worklist/analysis phase would not make optimisations on the fly. It would simply record a mask for each definition (e.g. in a map), making optimistic assumptions about definitions that have not been processed yet. If the mask calculated for a definition invalidates an earlier assumption, the definition would be pushed onto the worklist so that all its uses could be reevaluated. (See gimple-ssa-backprop for another pass that works like this, although I'm sure there are simpler examples.) That analysis framework seems generic rather than target-specific. Perhaps it should be separated from the pass and provided as an independent routine, so that multiple passes can use it. (I'd wondered at one point whether late-combine should do the same kind of analysis, but never got around to trying it.) Thanks, Richard
