[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

munroesj at gcc dot gnu.org via Gcc-bugs Fri, 01 Aug 2025 17:11:32 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480


--- Comment #11 from Steven Munroe <munroesj at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #9)
> Both vsl and vslo actually look only at the right-most byte in the shift
> amount argument (bits 125..127 resp. bits 121..124).  In original AltiVec it
> was required to hold the same value in every lane, but that isn't true since
> arch 2.03 anymore.  (See the PEM for the original stuff!)
> 
> The current rs6000 stuff will work correctly for ancient CPUs as well.
> 
> And the 970 actually has exactly the oldest VMX implementation!
> 
> So I wonder what we actually require there (and then if we want to change
> things!)

For this bugz the issue is generating an immediate shift count (7-bit) for
quadword.

Turns out it is hard to generate any small integer const into a vector
__int128. And any load of a const from .rodata is expensive (9-11 cycles for
power8/9 as generated by GCC up to and including gcc 14).

And as you point out the instructions vslo/vsro/vsl/vsr only care about bits
121..127. Also older machines needed the byte splat for vsl/vsr. 

So for P9 a single xxspltib should do for a vector quadword shift left/right by
a constant. So how should the average library developer write code to get that
(optimal)  result?

Also the Intrinsic ref for vec_sl/vec_sr/vec_rl requires the shift count be the
same size element as the shiftee. This shows up a bad code for doubleword
shifts and P8/9/10 and quadword shifts on P10. It would be nice if the these
intrinsics would also accept vector char as the shift count for all types of
shiftee.

[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

Reply via email to