Em sáb., 16 de nov. de 2024 às 11:40, Ranier Vilela <[email protected]>
escreveu:
>
> Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <
> [email protected]> escreveu:
>
>> Hi,
>>
>> On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
>> > There is a tiny typo with V13.
>> > + /* "len" in the [sizeof(size_t) * 8, inf] range */
>>
>> I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be
>> included
>> into a interval.
>>
>> Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at
>> the end)
>> might be the proper notation from a mathematical point of view.
>>
> Thanks for clarifying.
>
>
>>
>> > But, I'm not sure if I'm still doing something wrong.
>> > If so, forgive me for the noise.
>> >
>> > Of course I expected "not is_allzeros".
>>
>> That's the test case which is "wrong" (not the function):
>>
>> "
>> size_t pagebytes[BLCKSZ] = {0};
>> volatile bool result;
>>
>> pagebytes[BLCKSZ-2] = 1;
>>
>> result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
>> "
>>
>> The pagebytes is an array of size_t (8 bytes each), so the actual array
>> size
>> is 8192 * 8 = 65536 bytes.
>>
>> So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.
>>
>> But the function is checking up to BLCKSZ bytes (8192), so the results you
>> observed (which are correct).
>>
> Thanks for pointing out my mistake.
>
>
>>
>> > Anyway, I made another attempt to optimize a bit more, I don't know if
>> it's
>> > safe though.
>>
>> There is an issue in your v14, it calls:
>>
>> "
>> return pg_memory_is_all_zeros_simd(ptr, ptr + len);
>> "
>>
>> but you defined it that way:
>>
>> "
>> static inline bool
>> pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)
>>
>> "
>>
>> while that should be:
>>
>> "
>> static inline bool
>> pg_memory_is_all_zeros_simd(const void *p, const void *end)
>>
> What I'm trying here, obviously, is a hack.
> If it works, and the compiler accepts it, it's ok for me.
>
>
>> "
>>
>> Doing so, I do not observe any improvments with v14.
>>
> So.
> Again new results from v4_allzeros_small.c attached:
> Linux Ubuntu 22.04
> gcc 13 64 bits
>
> With BLCKSZ 32
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 44092 nanoseconds
> size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
> SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per
> byte)
> SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per
> byte)
> SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per
> byte)
> SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per
> byte)
>
> With BLCKSZ 63
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 67656 nanoseconds
> size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
> SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per
> byte)
> SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per
> byte)
> SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per
> byte)
> SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per
> byte)
>
> With BLCKSZ 256
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 220064 nanoseconds
> size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
> SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per
> byte)
> SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per
> byte)
> SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per
> byte)
> SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per
> byte)
>
> With BLCKSZ 8192
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 10365876 nanoseconds
> size_t: done in 827654 nanoseconds (12.5244 times faster than byte per
> byte)
> SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per
> byte)
> SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per
> byte)
> SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per
> byte)
> SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per
> byte)
>
> Results with v4_allzeros_check.c attached:
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=32
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=63
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=256
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> p01=(0x7ffedb8ac430)
> end=(0x7ffedb8ac530)
> p02=(0x7ffedb8ac530)
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=8192
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> p01=(0x7ffd8864c200)
> end=(0x7ffd8864e200)
> p02=(0x7ffd8864e200)
> SIMD v14: is_allzeros
>
> If this hack is safe and correct, I think that 204 times faster,
> it is very good, for a block size 8192.
>
> That said,
> V13 is fine as is.
> LGTM.
>
Now with files attached.
best regards,
Ranier Vilela
#include <stdbool.h>
#include <stddef.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <immintrin.h>
#define BLCKSZ 8192
#define LOOPS 1000
static inline bool
allzeros_byte_per_byte(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
allzeros_size_t(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v10(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v11(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/*
* For len < 64, compare byte per byte to ensure we'll not read beyond the
* memory area.
*/
if (len < sizeof(size_t) * 8)
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v12(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
if (len < sizeof(size_t)) // < 8 bytes
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
else if (len < sizeof(size_t) * 8) // 8-63 bytes
{
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
for (; p < end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v14(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
if ((len >= sizeof(size_t) * 8) && (((uintptr_t) p & (sizeof(size_t) - 1)) == 0))
{
return all_zeros_simd(ptr, ptr + len);
}
if (len < sizeof(size_t))
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
else if (len < sizeof(size_t) * 8)
{
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
#define NANOSEC_PER_SEC 1000000000
// Returns difference in nanoseconds
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
nanosec += (t1->tv_nsec - t2->tv_nsec);
return nanosec;
}
int main()
{
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;
struct timespec start,end;
int64_t byte_time, size_t_time;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = allzeros_byte_per_byte(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
byte_time = get_clock_diff(&end, &start);
printf("byte per byte: done in %ld nanoseconds\n", byte_time);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = allzeros_size_t(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("size_t: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = pg_memory_is_all_zeros_v10(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("SIMD v10: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = pg_memory_is_all_zeros_v11(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("SIMD v11: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("SIMD v12: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (int i = 0; i < LOOPS; i++)
{
result = pg_memory_is_all_zeros_v14(pagebytes, BLCKSZ);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
size_t_time = get_clock_diff(&end, &start);
printf("SIMD v14: done in %ld nanoseconds (%g times faster than byte per byte)\n", size_t_time, (double) byte_time / size_t_time);
return 0;
}
#include <stdbool.h>
#include <stddef.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <immintrin.h>
#define BLCKSZ 8192
#define LOOPS 1000
static inline bool
allzeros_byte_per_byte(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
allzeros_size_t(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v10(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v11(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
/*
* For len < 64, compare byte per byte to ensure we'll not read beyond the
* memory area.
*/
if (len < sizeof(size_t) * 8)
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
pg_memory_is_all_zeros_v12(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
if (len < sizeof(size_t)) // < 8 bytes
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
else if (len < sizeof(size_t) * 8) // 8-63 bytes
{
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
printf("p01=(%p)\n", p);
printf("end=(%p)\n", end);
for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
for (; p < end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
printf("p02=(%p)\n", p);
return true;
}
static inline bool
pg_memory_is_all_zeros_v14(const void *ptr, size_t len)
{
const unsigned char *p = (const unsigned char *) ptr;
const unsigned char *end = &p[len];
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
if ((len >= sizeof(size_t) * 8) && (((uintptr_t) p & (sizeof(size_t) - 1)) == 0))
{
return all_zeros_simd(ptr, ptr + len);
}
if (len < sizeof(size_t))
{
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
else if (len < sizeof(size_t) * 8)
{
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
/* Compare bytes until the pointer "p" is aligned */
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
/*
* Compare 8 * sizeof(size_t) chunks at once.
*
* For performance reasons, we manually unroll this loop and purposefully
* use bitwise-ORs to combine each comparison. This prevents boolean
* short-circuiting and lets the compiler know that it's safe to access
* all 8 elements regardless of the result of the other comparisons. This
* seems to be enough to coax a few compilers into using SIMD
* instructions.
*
* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.
*/
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
/*
* Compare remaining size_t-aligned chunks.
*
* aligned_end cant' be > end as we ensured to take care of len < 8 (in
* the len < 64 check below). So, no risk to read beyond the memory area.
*/
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *) p != 0)
return false;
}
/* Compare remaining bytes until the end */
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}
#define NANOSEC_PER_SEC 1000000000
// Returns difference in nanoseconds
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
nanosec += (t1->tv_nsec - t2->tv_nsec);
return nanosec;
}
int main()
{
//size_t pagebytes[BLCKSZ] = {0};
char pagebytes[BLCKSZ] = {0};
volatile bool result;
printf("sizeof(pagebytes)=%ld\n", sizeof(pagebytes));
result = allzeros_byte_per_byte(pagebytes, BLCKSZ);
printf("byte per byte: %s\n", (result?"is_allzeros":"not is allzeros"));
result = allzeros_size_t(pagebytes, BLCKSZ);
printf("size_t: %s\n", (result?"is_allzeros":"not is allzeros"));
result = pg_memory_is_all_zeros_v10(pagebytes, BLCKSZ);
printf("SIMD v10: %s\n", (result?"is_allzeros":"not is allzeros"));
result = pg_memory_is_all_zeros_v11(pagebytes, BLCKSZ);
printf("SIMD v11: %s\n", (result?"is_allzeros":"not is allzeros"));
result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
printf("SIMD v12: %s\n", (result?"is_allzeros":"not is allzeros"));
result = pg_memory_is_all_zeros_v14(pagebytes, BLCKSZ);
printf("SIMD v14: %s\n", (result?"is_allzeros":"not is allzeros"));
return 0;
}