https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122074
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to rockeet from comment #6)
> It is interesting that GCC fused the load into cmp if change the code a
> little:
>
> size_t avx512_search_byte_max32_2(const byte_t* data, size_t len, byte_t
> key) {
> __mmask32 k = _bzhi_u32(-1, len);
> return _tzcnt_u32(_mm256_mask_cmpeq_epi8_mask(k,
> *(__m256i_u*)data, _mm256_set1_epi8(key)));
> }
>
> see https://godbolt.org/z/W8MKTbKPv , it still generated an extra `mov eax,
> eax`
vpcmpeqb k0{k1}, ymm0, YMMWORD PTR [rdi] # 99 [c=25 l=6]
*avx512vl_eqv32qi3_mask_1/0
kmovd eax, k0 # 122 [c=4 l=3] *movsi_internal/16
tzcnt eax, eax # 107 [c=4 l=4] tzcnt_si
mov eax, eax # 110 [c=4 l=2] *zero_extendsidi2/3
The `mov eax, eax` is a zero_extend from 32-bit to 64-bit, and yes it looks
redundant since upper part of tzcnt result must be zero.