https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480
--- Comment #11 from Steven Munroe <munroesj at gcc dot gnu.org> --- (In reply to Segher Boessenkool from comment #9) > Both vsl and vslo actually look only at the right-most byte in the shift > amount argument (bits 125..127 resp. bits 121..124). In original AltiVec it > was required to hold the same value in every lane, but that isn't true since > arch 2.03 anymore. (See the PEM for the original stuff!) > > The current rs6000 stuff will work correctly for ancient CPUs as well. > > And the 970 actually has exactly the oldest VMX implementation! > > So I wonder what we actually require there (and then if we want to change > things!) For this bugz the issue is generating an immediate shift count (7-bit) for quadword. Turns out it is hard to generate any small integer const into a vector __int128. And any load of a const from .rodata is expensive (9-11 cycles for power8/9 as generated by GCC up to and including gcc 14). And as you point out the instructions vslo/vsro/vsl/vsr only care about bits 121..127. Also older machines needed the byte splat for vsl/vsr. So for P9 a single xxspltib should do for a vector quadword shift left/right by a constant. So how should the average library developer write code to get that (optimal) result? Also the Intrinsic ref for vec_sl/vec_sr/vec_rl requires the shift count be the same size element as the shiftee. This shows up a bad code for doubleword shifts and P8/9/10 and quadword shifts on P10. It would be nice if the these intrinsics would also accept vector char as the shift count for all types of shiftee.