https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122074

--- Comment #5 from rockeet <rockeet at gmail dot com> ---
(In reply to Andrew Pinski from comment #4)
> > Suffix "_u" in __m256i_u emphpasize we are > using an unaligned vector 
> > which should be > processed specially
> 
> No it does not mean that. It does mean it is unaligned.
> And gcc uses an unaligned load even:
>         vmovdqu ymm1, YMMWORD PTR [rdi]
> 
> And which is why at -O0, the loads are via bytes.
> 
> 
> Now there is a missed optimization of not fusing the load into the compare.

Fusing load into compare is excellent, I think it should also optimize mask
load into compare:
```
size_t avx512_search_byte_max32(const byte_t* data, size_t len, byte_t key) {
  __mmask32 k = _bzhi_u32(-1, len);
  __m256i   d = _mm256_maskz_loadu_epi8(k, data);
  return _tzcnt_u32(_mm256_mask_cmpeq_epi8_mask(k, d, _mm256_set1_epi8(key)));
}
```
the mask load and compre use same mask reg, they should be fused.

Reply via email to