Hi Peter, On Mon, Jul 24, 2017 at 05:16:32PM +0200, Peter Zijlstra wrote: > The initial value (@m) compute is: > > m = 1UL << (BITS_PER_LONG - 2); > while (m > x) > m >>= 2; > > Which is a linear search for the highest even bit smaller or equal to @x > We can implement this using a binary search using __fls() (or better > when its hardware implemented). > > m = 1UL << (__fls(x) & ~1UL); > > Especially for small values of @x; which are the more common > arguments; the linear search is near to worst case, while the binary > search of __fls() is a constant 6 branches. > > cycles: branches: branch-misses: > > PRE: > > hot: 43.633557 +- 0.034373 45.333132 +- 0.002277 0.023529 +- 0.000681 > cold: 207.438411 +- 0.125840 45.333132 +- 0.002277 6.976486 +- 0.004219 > > SOFTWARE FLS: > > hot: 29.576176 +- 0.028850 26.666730 +- 0.004511 0.019463 +- 0.000663 > cold: 165.947136 +- 0.188406 26.666746 +- 0.004511 6.133897 +- 0.004386 > > HARDWARE FLS: > > hot: 24.720922 +- 0.025161 20.666784 +- 0.004509 0.020836 +- 0.000677 > cold: 132.777197 +- 0.127471 20.666776 +- 0.004509 5.080285 +- 0.003874 > > Averages computed over all values <128k using a LFSR to generate > order. Cold numbers have a LFSR based branch trace buffer 'confuser' > ran between each int_sqrt() invocation.
The hardware fls version works nicely for arm64, where it can be implemented using the clz instruction (via the __builtin_clzl intrinsic). Acked-by: Will Deacon <[email protected]> Cheers, Will

