Am 11.03.2013 18:07, schrieb Paolo Bonzini: > Il 11/03/2013 18:06, ronnie sahlberg ha scritto: >> Even more efficient might be to do bitwise instead of logical or >> >>>> if (tmp | d1 | d2 | d3) { >> that should remove 3 of the 4 conditional jumps >> and should become 3 bitwise ors and one conditional jump > > Without any serious profiling, please let the compiler do that.
Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the various ideas. They all made no significant difference. Even unrolling to 8 unsigned longs didn't change anything. What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0); I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was complete zeroed so find_next_bit had to run completely through the bitfield. The original version took 1 minute and 10 seconds whereas all other took approx. 37-38 seconds which is almost a 100% boost ;-) So I think this here is the final version: while (size >= 4*BITS_PER_LONG) { unsigned long d1, d2, d3; tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp) { goto found_middle; } if (d1 || d2 || d3) { break; } p += 4; result += 4*BITS_PER_LONG; size -= 4*BITS_PER_LONG; } Peter