https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480

--- Comment #11 from Steven Munroe <munroesj at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #9)
> Both vsl and vslo actually look only at the right-most byte in the shift
> amount argument (bits 125..127 resp. bits 121..124).  In original AltiVec it
> was required to hold the same value in every lane, but that isn't true since
> arch 2.03 anymore.  (See the PEM for the original stuff!)
> 
> The current rs6000 stuff will work correctly for ancient CPUs as well.
> 
> And the 970 actually has exactly the oldest VMX implementation!
> 
> So I wonder what we actually require there (and then if we want to change
> things!)

For this bugz the issue is generating an immediate shift count (7-bit) for
quadword.

Turns out it is hard to generate any small integer const into a vector
__int128. And any load of a const from .rodata is expensive (9-11 cycles for
power8/9 as generated by GCC up to and including gcc 14).

And as you point out the instructions vslo/vsro/vsl/vsr only care about bits
121..127. Also older machines needed the byte splat for vsl/vsr. 

So for P9 a single xxspltib should do for a vector quadword shift left/right by
a constant. So how should the average library developer write code to get that
(optimal)  result?

Also the Intrinsic ref for vec_sl/vec_sr/vec_rl requires the shift count be the
same size element as the shiftee. This shows up a bad code for doubleword
shifts and P8/9/10 and quadword shifts on P10. It would be nice if the these
intrinsics would also accept vector char as the shift count for all types of
shiftee.

Reply via email to