Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot < bertranddrouvot...@gmail.com> escreveu:
> Hi, > > On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote: > > There is a tiny typo with V13. > > + /* "len" in the [sizeof(size_t) * 8, inf] range */ > > I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be > included > into a interval. > > Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at > the end) > might be the proper notation from a mathematical point of view. > Thanks for clarifying. > > > But, I'm not sure if I'm still doing something wrong. > > If so, forgive me for the noise. > > > > Of course I expected "not is_allzeros". > > That's the test case which is "wrong" (not the function): > > " > size_t pagebytes[BLCKSZ] = {0}; > volatile bool result; > > pagebytes[BLCKSZ-2] = 1; > > result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ); > " > > The pagebytes is an array of size_t (8 bytes each), so the actual array > size > is 8192 * 8 = 65536 bytes. > > So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1. > > But the function is checking up to BLCKSZ bytes (8192), so the results you > observed (which are correct). > Thanks for pointing out my mistake. > > > Anyway, I made another attempt to optimize a bit more, I don't know if > it's > > safe though. > > There is an issue in your v14, it calls: > > " > return pg_memory_is_all_zeros_simd(ptr, ptr + len); > " > > but you defined it that way: > > " > static inline bool > pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end) > > " > > while that should be: > > " > static inline bool > pg_memory_is_all_zeros_simd(const void *p, const void *end) > What I'm trying here, obviously, is a hack. If it works, and the compiler accepts it, it's ok for me. > " > > Doing so, I do not observe any improvments with v14. > So. Again new results from v4_allzeros_small.c attached: Linux Ubuntu 22.04 gcc 13 64 bits With BLCKSZ 32 gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small byte per byte: done in 44092 nanoseconds size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte) SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per byte) SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per byte) SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per byte) SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per byte) With BLCKSZ 63 gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small byte per byte: done in 67656 nanoseconds size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte) SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per byte) SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per byte) SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per byte) SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per byte) With BLCKSZ 256 gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small byte per byte: done in 220064 nanoseconds size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte) SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per byte) SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per byte) SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per byte) SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per byte) With BLCKSZ 8192 gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small byte per byte: done in 10365876 nanoseconds size_t: done in 827654 nanoseconds (12.5244 times faster than byte per byte) SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per byte) SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per byte) SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per byte) SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per byte) Results with v4_allzeros_check.c attached: gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check sizeof(pagebytes)=32 byte per byte: is_allzeros size_t: is_allzeros SIMD v10: is_allzeros SIMD v11: is_allzeros SIMD v12: is_allzeros SIMD v14: is_allzeros gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check sizeof(pagebytes)=63 byte per byte: is_allzeros size_t: is_allzeros SIMD v10: is_allzeros SIMD v11: is_allzeros SIMD v12: is_allzeros SIMD v14: is_allzeros gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check sizeof(pagebytes)=256 byte per byte: is_allzeros size_t: is_allzeros SIMD v10: is_allzeros SIMD v11: is_allzeros SIMD v12: is_allzeros p01=(0x7ffedb8ac430) end=(0x7ffedb8ac530) p02=(0x7ffedb8ac530) SIMD v14: is_allzeros gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check sizeof(pagebytes)=8192 byte per byte: is_allzeros size_t: is_allzeros SIMD v10: is_allzeros SIMD v11: is_allzeros SIMD v12: is_allzeros p01=(0x7ffd8864c200) end=(0x7ffd8864e200) p02=(0x7ffd8864e200) SIMD v14: is_allzeros If this hack is safe and correct, I think that 204 times faster, it is very good, for a block size 8192. That said, V13 is fine as is. LGTM. best regards, Ranier Vilela