Em sáb., 16 de nov. de 2024 às 11:40, Ranier Vilela <ranier...@gmail.com> escreveu:
> > Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot < > bertranddrouvot...@gmail.com> escreveu: > >> Hi, >> >> On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote: >> > There is a tiny typo with V13. >> > + /* "len" in the [sizeof(size_t) * 8, inf] range */ >> >> I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be >> included >> into a interval. >> >> Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at >> the end) >> might be the proper notation from a mathematical point of view. >> > Thanks for clarifying. > > >> >> > But, I'm not sure if I'm still doing something wrong. >> > If so, forgive me for the noise. >> > >> > Of course I expected "not is_allzeros". >> >> That's the test case which is "wrong" (not the function): >> >> " >> size_t pagebytes[BLCKSZ] = {0}; >> volatile bool result; >> >> pagebytes[BLCKSZ-2] = 1; >> >> result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ); >> " >> >> The pagebytes is an array of size_t (8 bytes each), so the actual array >> size >> is 8192 * 8 = 65536 bytes. >> >> So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1. >> >> But the function is checking up to BLCKSZ bytes (8192), so the results you >> observed (which are correct). >> > Thanks for pointing out my mistake. > > >> >> > Anyway, I made another attempt to optimize a bit more, I don't know if >> it's >> > safe though. >> >> There is an issue in your v14, it calls: >> >> " >> return pg_memory_is_all_zeros_simd(ptr, ptr + len); >> " >> >> but you defined it that way: >> >> " >> static inline bool >> pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end) >> >> " >> >> while that should be: >> >> " >> static inline bool >> pg_memory_is_all_zeros_simd(const void *p, const void *end) >> > What I'm trying here, obviously, is a hack. > If it works, and the compiler accepts it, it's ok for me. > > >> " >> >> Doing so, I do not observe any improvments with v14. >> > So. > Again new results from v4_allzeros_small.c attached: > Linux Ubuntu 22.04 > gcc 13 64 bits > > With BLCKSZ 32 > gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; > ./v4_allzeros_small > byte per byte: done in 44092 nanoseconds > size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte) > SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per > byte) > SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per > byte) > SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per > byte) > SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per > byte) > > With BLCKSZ 63 > gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; > ./v4_allzeros_small > byte per byte: done in 67656 nanoseconds > size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte) > SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per > byte) > SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per > byte) > SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per > byte) > SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per > byte) > > With BLCKSZ 256 > gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; > ./v4_allzeros_small > byte per byte: done in 220064 nanoseconds > size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte) > SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per > byte) > SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per > byte) > SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per > byte) > SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per > byte) > > With BLCKSZ 8192 > gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; > ./v4_allzeros_small > byte per byte: done in 10365876 nanoseconds > size_t: done in 827654 nanoseconds (12.5244 times faster than byte per > byte) > SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per > byte) > SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per > byte) > SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per > byte) > SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per > byte) > > Results with v4_allzeros_check.c attached: > gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; > ./v4_allzeros_check > sizeof(pagebytes)=32 > byte per byte: is_allzeros > size_t: is_allzeros > SIMD v10: is_allzeros > SIMD v11: is_allzeros > SIMD v12: is_allzeros > SIMD v14: is_allzeros > > gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; > ./v4_allzeros_check > sizeof(pagebytes)=63 > byte per byte: is_allzeros > size_t: is_allzeros > SIMD v10: is_allzeros > SIMD v11: is_allzeros > SIMD v12: is_allzeros > SIMD v14: is_allzeros > > gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; > ./v4_allzeros_check > sizeof(pagebytes)=256 > byte per byte: is_allzeros > size_t: is_allzeros > SIMD v10: is_allzeros > SIMD v11: is_allzeros > SIMD v12: is_allzeros > p01=(0x7ffedb8ac430) > end=(0x7ffedb8ac530) > p02=(0x7ffedb8ac530) > SIMD v14: is_allzeros > > gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; > ./v4_allzeros_check > sizeof(pagebytes)=8192 > byte per byte: is_allzeros > size_t: is_allzeros > SIMD v10: is_allzeros > SIMD v11: is_allzeros > SIMD v12: is_allzeros > p01=(0x7ffd8864c200) > end=(0x7ffd8864e200) > p02=(0x7ffd8864e200) > SIMD v14: is_allzeros > > If this hack is safe and correct, I think that 204 times faster, > it is very good, for a block size 8192. > > That said, > V13 is fine as is. > LGTM. > Now with files attached. best regards, Ranier Vilela
#include <stdbool.h> #include <stddef.h> #include <string.h> #include <stdio.h> #include <stdint.h> #include <time.h> #include <immintrin.h> #define BLCKSZ 8192 #define LOOPS 1000 static inline bool allzeros_byte_per_byte(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool allzeros_size_t(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v10(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v11(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* * For len < 64, compare byte per byte to ensure we'll not read beyond the * memory area. */ if (len < sizeof(size_t) * 8) { while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v12(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); if (len < sizeof(size_t)) // < 8 bytes { while (p < end) { if (*p++ != 0) return false; } return true; } else if (len < sizeof(size_t) * 8) // 8-63 bytes { while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool all_zeros_simd(const size_t *p, const size_t * end) { for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } for (; p < end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v14(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); if ((len >= sizeof(size_t) * 8) && (((uintptr_t) p & (sizeof(size_t) - 1)) == 0)) { return all_zeros_simd(ptr, ptr + len); } if (len < sizeof(size_t)) { while (p < end) { if (*p++ != 0) return false; } return true; } else if (len < sizeof(size_t) * 8) { while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } #define NANOSEC_PER_SEC 1000000000 // Returns difference in nanoseconds int64_t get_clock_diff(struct timespec *t1, struct timespec *t2) { int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC; nanosec += (t1->tv_nsec - t2->tv_nsec); return nanosec; } int main() { size_t pagebytes[BLCKSZ] = {0}; volatile bool result; struct timespec start,end; int64_t byte_time, size_t_time; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = allzeros_byte_per_byte(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); byte_time = get_clock_diff(&end, &start); printf("byte per byte: done in %ld nanoseconds\n", byte_time); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = allzeros_size_t(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); size_t_time = get_clock_diff(&end, &start); printf("size_t: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = pg_memory_is_all_zeros_v10(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); size_t_time = get_clock_diff(&end, &start); printf("SIMD v10: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = pg_memory_is_all_zeros_v11(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); size_t_time = get_clock_diff(&end, &start); printf("SIMD v11: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); size_t_time = get_clock_diff(&end, &start); printf("SIMD v12: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i = 0; i < LOOPS; i++) { result = pg_memory_is_all_zeros_v14(pagebytes, BLCKSZ); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); size_t_time = get_clock_diff(&end, &start); printf("SIMD v14: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time); return 0; }
#include <stdbool.h> #include <stddef.h> #include <string.h> #include <stdio.h> #include <stdint.h> #include <time.h> #include <immintrin.h> #define BLCKSZ 8192 #define LOOPS 1000 static inline bool allzeros_byte_per_byte(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool allzeros_size_t(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v10(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v11(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); /* * For len < 64, compare byte per byte to ensure we'll not read beyond the * memory area. */ if (len < sizeof(size_t) * 8) { while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool pg_memory_is_all_zeros_v12(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); if (len < sizeof(size_t)) // < 8 bytes { while (p < end) { if (*p++ != 0) return false; } return true; } else if (len < sizeof(size_t) * 8) // 8-63 bytes { while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } static inline bool all_zeros_simd(const size_t *p, const size_t * end) { printf("p01=(%p)\n", p); printf("end=(%p)\n", end); for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } for (; p < end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } printf("p02=(%p)\n", p); return true; } static inline bool pg_memory_is_all_zeros_v14(const void *ptr, size_t len) { const unsigned char *p = (const unsigned char *) ptr; const unsigned char *end = &p[len]; const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); if ((len >= sizeof(size_t) * 8) && (((uintptr_t) p & (sizeof(size_t) - 1)) == 0)) { return all_zeros_simd(ptr, ptr + len); } if (len < sizeof(size_t)) { while (p < end) { if (*p++ != 0) return false; } return true; } else if (len < sizeof(size_t) * 8) { while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } while (p < end) { if (*p++ != 0) return false; } return true; } /* Compare bytes until the pointer "p" is aligned */ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) { if (p == end) return true; if (*p++ != 0) return false; } /* * Compare 8 * sizeof(size_t) chunks at once. * * For performance reasons, we manually unroll this loop and purposefully * use bitwise-ORs to combine each comparison. This prevents boolean * short-circuiting and lets the compiler know that it's safe to access * all 8 elements regardless of the result of the other comparisons. This * seems to be enough to coax a few compilers into using SIMD * instructions. * * There is no risk to read beyond the memory area thanks to the len < 64 * check done below. */ for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } /* * Compare remaining size_t-aligned chunks. * * aligned_end cant' be > end as we ensured to take care of len < 8 (in * the len < 64 check below). So, no risk to read beyond the memory area. */ for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *) p != 0) return false; } /* Compare remaining bytes until the end */ while (p < end) { if (*p++ != 0) return false; } return true; } #define NANOSEC_PER_SEC 1000000000 // Returns difference in nanoseconds int64_t get_clock_diff(struct timespec *t1, struct timespec *t2) { int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC; nanosec += (t1->tv_nsec - t2->tv_nsec); return nanosec; } int main() { //size_t pagebytes[BLCKSZ] = {0}; char pagebytes[BLCKSZ] = {0}; volatile bool result; printf("sizeof(pagebytes)=%ld\n", sizeof(pagebytes)); result = allzeros_byte_per_byte(pagebytes, BLCKSZ); printf("byte per byte: %s\n", (result?"is_allzeros":"not is allzeros")); result = allzeros_size_t(pagebytes, BLCKSZ); printf("size_t: %s\n", (result?"is_allzeros":"not is allzeros")); result = pg_memory_is_all_zeros_v10(pagebytes, BLCKSZ); printf("SIMD v10: %s\n", (result?"is_allzeros":"not is allzeros")); result = pg_memory_is_all_zeros_v11(pagebytes, BLCKSZ); printf("SIMD v11: %s\n", (result?"is_allzeros":"not is allzeros")); result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ); printf("SIMD v12: %s\n", (result?"is_allzeros":"not is allzeros")); result = pg_memory_is_all_zeros_v14(pagebytes, BLCKSZ); printf("SIMD v14: %s\n", (result?"is_allzeros":"not is allzeros")); return 0; }