On Tue, Nov 12, 2024 at 10:56:20AM +0000, Bertrand Drouvot wrote: > I think that depends of the memory area size. If the size is small enough > then the > byte per byte can be good enough. > > For example, with the allzeros_small.c attached: > > == with BLCKSZ 32 > > $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o > allzeros_small ; ./allzeros_small > byte per byte: done in 22528 nanoseconds > size_t: done in 6949 nanoseconds (3.24191 times faster than byte per byte) > SIMD v10: done in 7562 nanoseconds (2.97911 times faster than byte per byte) > SIMD v11: done in 22096 nanoseconds (1.01955 times faster than byte per byte)
Some numbers from here, for the same test case at 32 bytes, with an older version of gcc: $ gcc --version gcc (Debian 10.2.1-6) 10.2.1 20210110 $ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small byte per byte: done in 28193 nanoseconds size_t: done in 4382 nanoseconds (6.43382 times faster than byte per byte) SIMD v10: done in 8074 nanoseconds (3.49183 times faster than byte per byte) SIMD v11: done in 26970 nanoseconds (1.04535 times faster than byte per byte) > == with BLCKSZ 63 > > $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o > allzeros_small ; ./allzeros_small > byte per byte: done in 29246 nanoseconds > size_t: done in 10555 nanoseconds (2.77082 times faster than byte per byte) > SIMD v10: done in 11220 nanoseconds (2.6066 times faster than byte per byte) > SIMD v11: done in 29126 nanoseconds (1.00412 times faster than byte per byte) > > Obviously v11 is about the same time as "byte per byte" but we can see that > the > size_t or v10 improvment is not that much for small size. For 63 bytes: byte per byte: done in 52611 nanoseconds size_t: done in 21309 nanoseconds (2.46896 times faster than byte per byte) SIMD v10: done in 16181 nanoseconds (3.25141 times faster than byte per byte) SIMD v11: done in 51931 nanoseconds (1.01309 times faster than byte per byte) > While for larger size: > > It's sensitive improvment. Yep, for large sizes. > Based on the above I've the feeling that doing byte per byte comparison for > small size only (< 64b) is good enough. I'm not sure that adding extra > complexity > for small sizes is worth it. Well, this is also telling us that we are at least 2 times faster if we use allzeros_size_t() for areas smaller than 64 bytes rather than allzeros_byte_per_byte() per your measurement, and I'm seeing even faster numbers. So that seems worth the addition, especially for smaller sizes where this is 6 times faster here. -- Michael
signature.asc
Description: PGP signature