On Mon, Mar 2, 2026 at 5:36 PM Roger Sayle <[email protected]> wrote: > > > Hi Hongtao, > Many thanks for reviewing the x86_64 pieces. > > > if (negate) > > - cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp, > > - CONST0_RTX (GET_MODE (cmp)), > > - NULL, NULL, &negate); > > - > > - gcc_assert (!negate); > > + { > > + if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16) > > + cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX > > (GET_MODE (cmp))); > > + else > > + { > > + cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp, > > + CONST0_RTX (GET_MODE (cmp)), > > + NULL, NULL, &negate); > > + gcc_assert (!negate); > > + } > > + } > > > > Technically it's correct, however, in actual scenarios, avx512 (x86-64-v4) > will enter > > ix86_expand_mask_vec_cmp, so this optimization appears to only target the > > scenario of avx512f + no-avx512vl + VL == 16/32, which doesn't sound > particularly > > useful. > > The mistake in this reasoning is that this function is entered in actual > scenarios. > > Consider: > > typedef char v32qi __attribute__((vector_size(16))); > v32qi x, y, m; > void bar() { m = x != y; } > > which when compiled with -O2 -mavx512vl on mainline currently generates: > > foo: vmovdqa x(%rip), %xmm0 > vpxor %xmm1, %xmm1, %xmm1 > vpcmpeqb y(%rip), %xmm0, %xmm0 > vpcmpeqb %xmm1, %xmm0, %xmm0 > vmovdqa %xmm0, m(%rip) > ret > > which uses vpxor and vpcmpeqb to invert the mask. > with the proposed chunk above, we instead generate: > > foo: vmovdqa x(%rip), %xmm0 > vpcmpeqb y(%rip), %xmm0, %xmm0 > vpternlogd $0x55, %xmm0, %xmm0, %xmm0 > vmovdqa %xmm0, m(%rip) > ret > > Not only is this one less instruction, and shorter in bytes, > but the not/xor/ternlog can be fused by combine with any > following binary logic, where unfortunately the vpcmpeqb > against zero can't (easily) be. > > The Bugzilla PR concerns x86_64 using vpcmpeqb to > negate masks when it shouldn't be; the example above > is exactly the sort of case that it was complaining about. > I was hoping the above not/xor/ternlog and a following > blend or pand-pandn-por could eventually be fused into > a single ternlog instruction, i.e. with ternlog the RTL > optimizers (combine) can potentially swap operands of > VCOND_MASK without requiring the middle-end's help.
I c, thanks for the explanation. > > Thanks (again) in advance, > Roger > -- > > -- BR, Hongtao
