https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90483
Bug ID: 90483 Summary: input to ptest not optimized Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Target: x86_64-*-*, i?86-*-* The (V)PTEST instruction of SSE4.1/AVX produces ZF = `(a & b) == 0` and CF = `(~a & b) == 0`. Generic usage of PTEST simply sets `b = ~__m128i()` (or `~__m256i()`), i.e. tests `a` and `~a` for having only zero bits. (cf. _mm_test_all_ones) Consequently, if `a` is the result of a vector comparison which only depends on a bitmask, the compare instruction can be elided and the `~__m128i()` mask replaced with the corresponding bitmask. Examples: // test sign bit bool bad(__v16qu x) { return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f); } Since x > 0x7f can be rewritten as a test for the sign bit, we can optimize to (with 0x808080... at LC0): vptest .LC0(%rip), %xmm0 sete %al ret // test for zero bool bad2(__v16qu x) { return __builtin_ia32_ptestz128(~__v16qu(), x == 0); } This equivalent to testing scalars for 0, i.e. we can optimize to: vptest %xmm0, %xmm0 sete %al ret // test for certain bits bool bad3(__v16qu x, __v16qu k) { return __builtin_ia32_ptestz128(~__v16qu(), (x & k) == 0); } With the above transformation we already get PTEST(x&k, x&k) which can consequently be reduced to PTEST(x, k): vptest %xmm0, %xmm1 sete %al ret Further optimization of e.g. `(x & ~k) == 0` using CF instead of ZF might also be interesting. And of course, these transformations apply to all vector types, not just __v16qu.