On 22/10/2015 16:37, Eric Blake wrote:
>> > + /* Check first 16 bytes manually. */
>> > + for (len = 0; len < 16; len++)
>> > + {
>> > + if (! bufsize)
>> > + return true;
>> > + if (*p)
>> > + return false;
>> > + p++;
>> > + bufsize--;
>> > + }
>> > +
>> > + /* Now we know that's zero, memcmp with self. */
>> > + return memcmp (buf, p, bufsize) == 0;
>> > }
> Cool trick of using a suitably-aligned overlap-to-self check to then
> trigger platform-specific speedups without having to rewrite them by
> hand! qemu is doing a similar check in util/cutils.c:buffer_is_zero()
> that could probably benefit from the same idea.
Nice trick indeed. On the other hand, the first 16 bytes are enough to
rule out 99.99% (number out of thin hair) of the non-zero blocks, so
that's where you want to optimize. Checking them an unsigned long at a
time, or fetching a few unsigned longs and ORing them together would
probably be the best of both worlds, because you then only use the FPU
in the rare case of a zero buffer.
Paolo