On Tue, Nov 12, 2024 at 10:56:20AM +0000, Bertrand Drouvot wrote:
> I think that depends of the memory area size. If the size is small enough 
> then the
> byte per byte can be good enough.
> 
> For example, with the allzeros_small.c attached:
> 
> == with BLCKSZ 32
> 
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o 
> allzeros_small ; ./allzeros_small
> byte per byte: done in 22528 nanoseconds
> size_t: done in 6949 nanoseconds (3.24191 times faster than byte per byte)
> SIMD v10: done in 7562 nanoseconds (2.97911 times faster than byte per byte)
> SIMD v11: done in 22096 nanoseconds (1.01955 times faster than byte per byte)

Some numbers from here, for the same test case at 32 bytes, with an
older version of gcc:
$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 28193 nanoseconds
size_t: done in 4382 nanoseconds (6.43382 times faster than byte per byte)
SIMD v10: done in 8074 nanoseconds (3.49183 times faster than byte per byte)
SIMD v11: done in 26970 nanoseconds (1.04535 times faster than byte per byte)

> == with BLCKSZ 63
> 
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o 
> allzeros_small ; ./allzeros_small
> byte per byte: done in 29246 nanoseconds
> size_t: done in 10555 nanoseconds (2.77082 times faster than byte per byte)
> SIMD v10: done in 11220 nanoseconds (2.6066 times faster than byte per byte)
> SIMD v11: done in 29126 nanoseconds (1.00412 times faster than byte per byte)
> 
> Obviously v11 is about the same time as "byte per byte" but we can see that 
> the
> size_t or v10 improvment is not that much for small size.

For 63 bytes:
byte per byte: done in 52611 nanoseconds
size_t: done in 21309 nanoseconds (2.46896 times faster than byte per byte)
SIMD v10: done in 16181 nanoseconds (3.25141 times faster than byte per byte)
SIMD v11: done in 51931 nanoseconds (1.01309 times faster than byte per byte)


> While for larger size:
> 
> It's sensitive improvment.

Yep, for large sizes.

> Based on the above I've the feeling that doing byte per byte comparison for
> small size only (< 64b) is good enough. I'm not sure that adding extra 
> complexity
> for small sizes is worth it.

Well, this is also telling us that we are at least 2 times faster if
we use allzeros_size_t() for areas smaller than 64 bytes rather than
allzeros_byte_per_byte() per your measurement, and I'm seeing even
faster numbers.  So that seems worth the addition, especially for
smaller sizes where this is 6 times faster here.
--
Michael

Attachment: signature.asc
Description: PGP signature

Reply via email to