On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier...@gmail.com> wrote: > I think we can add a small optimization to this last patch [1].
I think if you want to make it faster, you could partially unroll the inner-most loop, like: // size_t * 4 for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4) { if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *) p)[2] != 0 | ((size_t *) p)[3] != 0) return false; } $ gcc allzeros.c -O2 -o allzeros && ./allzeros char: done in 1595000 nanoseconds size_t: done in 198300 nanoseconds (8.04337 times faster than char) size_t * 4: done in 81500 nanoseconds (19.5706 times faster than char) size_t * 8: done in 71000 nanoseconds (22.4648 times faster than char) The final one above is 110GB/sec, so probably only going that fast because the memory being checked is in L1. DDR5 is only 64GB/sec. So it's probably overkill to unroll the loop that much. Also, doing something like that means the final byte-at-a-time loop might have more to do, which might cases with a long remainder slower. To make up for that there's some incentive to introduce yet another loop to process single size_t's up to aligned_end. Then you end up with even more code. I was happy enough with my patch with Bertrand's comments. I'm not sure why unsigned chars are better than chars. It doesn't seem to have any effect on the compiled code. David