> It is because mask 0xffffffff is optimized to 0xfffffffc by keeping track > of non-zero bits in registers and the above code doesn't take that > into account.
Then I'd suggest modifying that code so that it does rather than essentially duplicating it. But I'd recommend running some performance tests to verify that you're not pessimizing things when you do that: this stuff can be very tricky and you want to make sure that you're not converting something like (and X 3) into a bit extraction unnecessarily.