I suppose you're using (assuming 32-bit)
ctz(x) := 31 - clz(x & -x)
now, which gives -1 for 0; and the version you're looking for is
ctz(x) := 32 - clz(~x & (x-1))
which gives 32 for 0.
Thanks! That's, unfortunately, one more instruction, although I guess
a lot of chips have "a & ~b" as one operation.
Yes, it's exactly the same cost on PowerPC, and on most other
RISC architectures.
It looks like ~x & (x-1) turns any number into 000...111... where the
boundary between zeroes and ones lies at the lowest 1 in the original.
Exactly. "To the right of the lowest 1".
Is popcount really slow on PowerPC? (Compared to clz?) Ideally one
would choose between the two expansions based on RTL costs, but the
only architectures it matters for are i386 and powerpc, and neither
of them define the cost of either clz or popcount.
Andrew answered this already. Adding clz/popcount to the cost
tables seems like a good idea, yes.
Segher