https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88276
Bug ID: 88276
Summary: AVX512: reorder bit ops to get free and operation
Product: gcc
Version: 8.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: [email protected]
Target Milestone: ---
[code]
#include <immintrin.h>
#include <stdint.h>
int test1(const __m128i* src, int mask)
{
__m128i v = _mm_load_si128(src);
int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
return (cmp << 1) & mask;
}
int test2(const __m128i* src, int mask)
{
__m128i v = _mm_load_si128(src);
int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
return (cmp & (mask >> 1)) << 1;
}
[/code]
test1() shifts result of _mm_cmpeq_epi16_mask() first, then and it with mask.
In test2() mask is shifted first, then and-ed with cmp result, and then shifted
again. In this case result of _mm_cmpeq_epi16_mask uses 8 bits only, so both
code versions are equivalent.
This compiles to following asm code, using gcc 8.2 with -O3
-march=skylake-avx512:
[asm]
test1(long long __vector(2) const*, int):
vpxor xmm0, xmm0, xmm0
vpcmpeqw k1, xmm0, XMMWORD PTR [rdi]
kmovb edx, k1
lea eax, [rdx+rdx]
and eax, esi
ret
test2(long long __vector(2) const*, int):
mov eax, esi
sar eax
vpxor xmm0, xmm0, xmm0
kmovb k2, eax
vpcmpeqw k1{k2}, xmm0, XMMWORD PTR [rdi]
kmovb eax, k1
add eax, eax
ret
[/asm]
Such change may lead to more effective code, as with AVX512 this and op can be
merged into vpcmpeqw instruction. In my case this was part of bigger function
which was performing series of such calculations on array, and after this
change it started working faster.