On Fri, 5 Apr 2024 at 07:15, Nathan Bossart <[email protected]> wrote:
> Here is an updated patch set. IMHO this is in decent shape and is
> approaching committable.
I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.
The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:
clang:
<L2>:
50: add rdx, 64
54: cmp rdx, rdi
57: jae <L1>
59: vpopcntq zmm1, zmmword ptr [rdx]
5f: vpaddq zmm0, zmm1, zmm0
65: jmp <L2>
gcc:
<L1>:
38: kmovq k1, rdx
3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
43: add rax, 64
47: mov rdx, -1
4e: vpopcntq zmm0, zmm0
54: vpaddq zmm0, zmm0, zmm1
5a: vmovdqa64 zmm1, zmm0
60: cmp rax, rsi
63: jb <L1>
I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.
The visibility map patch has the same issue, otherwise looks good.
Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dacc7553d29..f6e718b86e9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes)
* Iterate through all but the final iteration. Starting from second
* iteration, the start index mask is ignored.
*/
- for (; buf < final; buf += sizeof(__m512i))
+ if (buf < final)
{
val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
cnt = _mm512_popcnt_epi64(val);
accum = _mm512_add_epi64(accum, cnt);
+ buf += sizeof(__m512i);
mask = ~UINT64CONST(0);
+
+ for (; buf < final; buf += sizeof(__m512i))
+ {
+ val = _mm512_load_si512((const __m512i *) buf);
+ cnt = _mm512_popcnt_epi64(val);
+ accum = _mm512_add_epi64(accum, cnt);
+ }
}
/* Final iteration needs to ignore bytes that are not within the length */