https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101639
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #11)
> (In reply to Hongtao Liu from comment #10)
> > clang generates
> >
> > avx512:
> > f(int*, long):
> > vmovdqu xmm0, xmmword ptr [rdi]
> > vptestnmd k0, xmm0, xmm0
> > kortestb k0, k0
> > sete al
> > ret
> >
> > avx2:
> > f(int*, long):
> > vpxor xmm0, xmm0, xmm0
> > vpcmpeqd xmm0, xmm0, xmmword ptr [rdi]
> > vmovmskps eax, xmm0
> > test eax, eax
> > sete al
> > ret
> >
> > Maybe GCC can reuse cstorem4 similar as cbranchm4 for those mask.
>
> Yes, I have not tried to implement native vector mask reduction, instead
> I'm going via a data bool vector for the epilogue to use tested code.
For XOR cstorem4 isn't of help, but if we can get a scalar bit mask we
can use popcount&1 here. Targets with separate vector modes for masks
can use reduc_{and,ior,xor}_scal but on x86 with either integer vector modes
or integer scalar modes that's going to be difficult. A more explicit
reduc_mask_{and,ior,xor}_scal would be better there.