[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

jskumari at gcc dot gnu.org via Gcc-bugs Tue, 05 Aug 2025 03:23:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119702


--- Comment #16 from Surya Kumari Jangala <jskumari at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #13)
> (In reply to Surya Kumari Jangala from comment #12)
> > Ok. We also need to tackle the original issue, which is that a shift left
> > can be optimized by generating a vector add. Perhaps tackle this issue 
> > first?
> 
> Yup.
> 
> Most of it is handled by generic things already: shifts are always better
> than
> mults.  And in most cases additions are faster than shifts (or you can do
> more
> of them concurrently or similar), so in many cases they are preferred, but
> that
> is not so super obvious already.  You might be able to do four adds
> concurrently,
> but you might be able to do two shifts concurrently additionally, so it all
> depends on what other code there is what works best there, and for what works
> best usually you have to look at the usual instruction mix.
> 
> There is no canonical representation of this in RTL, either: both x+x and
> x<<1 are fine.
> 
> So, if we really care, we should have patterns for both representations
> in our backend, and generate whatever is the best code (for the selected
> CPU!)
> 
> In practice both are acceptable, so as long as we get the best code in the
> common cases, we should be happy already :-)

With the testcase in the "Description", we are seeing both a splat and a shift
being generated. Instead, a single add instruction is more efficient.

[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

Reply via email to