On Wed, Aug 13, 2025 at 1:40 AM Andi Kleen <a...@firstfloor.org> wrote: > > > > > The latter takes 5 cycles, the former takes 3 cycles. > > It's pipelined however. > > > > > Do you have any microbenchmark or real workloads to show your > > optimization is better? > > Keep in mind it only uses one port vs two. > > Yes I ran it on Arrow lake and saw wins on both Pcore and Ecore > according to the througputs. In fact Ecore saw higher wins > even though it has higher latency (it was surprising to me) > > Another advantage is using less registers so more unrolling is possible. > > It might be reasonable to tweak the costs per CPU however, I haven't > done that. > > BTW for rotate the wins are much higher because there are no native > instructions for it. For ashl/lshr, the original implementation only takes 2 instructions(vpsllw/vpsrlw + vpand), and for ashr when shift count is 7, it only takes 1 instruction(vpcmpgtb). .i.e https://godbolt.org/z/Wef97YqGx So I'd like to keep the original implementation for them.
For ashr(w/ shift count != 7) and rotate, I agree with your point. > > -Andi > -- BR, Hongtao