On Mon, 21 Apr 2008 15:36:06 +0200, "Gabriel Paubert" <[EMAIL PROTECTED]> said: > On Mon, Apr 21, 2008 at 03:07:13PM +0200, Alexander van Heukelum wrote: > > On Mon, 21 Apr 2008 22:13:06 +1000, "Paul Mackerras" <[EMAIL PROTECTED]> > > said: > > > Alexander van Heukelum writes: > > > > Powerpc would pick up an optimized version via this chain: generic fls64 > > > > -> > > > > powerpc __fls --> __ilog2 --> asm (PPC_CNTLZL "%0,%1" : "=r" (lz) : "r" > > > > (x)). > > > > > > Why wouldn't powerpc continue to use the fls64 that I have in there > > > now? > > > > In Linus' tree that would be the generic one that uses (the 32-bit) > > fls(): > > > > static inline int fls64(__u64 x) > > { > > __u32 h = x >> 32; > > if (h) > > return fls(h) + 32; > > return fls(x); > > } > > > > > > However, the generic version of fls64 first tests the argument for zero. > > > > From > > > > your code I derive that the count-leading-zeroes instruction for > > > > argument zero > > > > is defined as cntlzl(0) == BITS_PER_LONG. > > > > > > That is correct. If the argument is 0 then all of the zero bits are > > > leading zeroes. :) > > > > So... for 64-bit powerpc it makes sense to have its own implementation > > and ignore the (improved) generic one and for 32-bit powerpc the generic > > implementation of fls64 is fine. The current situation in linux-next > > seems > > optimal to me. > > > Not so sure, the optimal version of fls64 for 32 bit PPC seems to be: > > cntlzw ch,h ; ch = fls32(h) where h = x>>32 > cntlzw cl,l ; cl = fls32(l) where l = (__u32)x > srwi t1,ch,5 > neg t1,t1 ; t1 = (h==0) ? -1 : 0 > and cl,t1,cl ; cl = (h==0) ? cl : 0 > add result,ch,cl > > That's only 6 instructions without any branch, although the dependency > chain is 5 instructions long. Good luck getting the compiler to > generate something as compact as this.
I should not have said the magic word optimal, I guess ;). The code you show would fit nicely as an arch-specific optimized version of fls64 for 32-bit powerpc in include/arch-powerpc/bitops.h. Greetings, Alexander (who is not going to write and test a patch with powerpc inline assembly soon. srwi?) > Don't worry about the number of cntlzw, it's one clock on all 32 bit > PPC processors I know, some may even be able to perform 2 or 3 cntlzw > per clock. > > Regards, > Gabriel > -- Alexander van Heukelum [EMAIL PROTECTED] -- http://www.fastmail.fm - Same, same, but differentÂ… _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev