> > It might be reasonable to tweak the costs per CPU however, I haven't
> > done that.
> >
> > BTW for rotate the wins are much higher because there are no native
> > instructions for it.
> For ashl/lshr, the original implementation only takes 2
> instructions(vpsllw/vpsrlw + vpand), and for ashr when shift count is

But two registers. Shorter dependency chains are usually better. 
Also this case is faster in my micros using GFNI, although not by much
(1-4%)

> 7, it only takes 1 instruction(vpcmpgtb).  .i.e
> https://godbolt.org/z/Wef97YqGx

Makes sense.

-Andi

Reply via email to