> > The latter takes 5 cycles, the former takes 3 cycles. It's pipelined however.
> > Do you have any microbenchmark or real workloads to show your > optimization is better? Keep in mind it only uses one port vs two. Yes I ran it on Arrow lake and saw wins on both Pcore and Ecore according to the througputs. In fact Ecore saw higher wins even though it has higher latency (it was surprising to me) Another advantage is using less registers so more unrolling is possible. It might be reasonable to tweak the costs per CPU however, I haven't done that. BTW for rotate the wins are much higher because there are no native instructions for it. -Andi