https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90483
Bug ID: 90483
Summary: input to ptest not optimized
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: kretz at kde dot org
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
The (V)PTEST instruction of SSE4.1/AVX produces ZF = `(a & b) == 0` and CF =
`(~a & b) == 0`. Generic usage of PTEST simply sets `b = ~__m128i()` (or
`~__m256i()`), i.e. tests `a` and `~a` for having only zero bits. (cf.
_mm_test_all_ones)
Consequently, if `a` is the result of a vector comparison which only depends on
a bitmask, the compare instruction can be elided and the `~__m128i()` mask
replaced with the corresponding bitmask.
Examples:
// test sign bit
bool bad(__v16qu x) {
return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f);
}
Since x > 0x7f can be rewritten as a test for the sign bit, we can optimize to
(with 0x808080... at LC0):
vptest .LC0(%rip), %xmm0
sete %al
ret
// test for zero
bool bad2(__v16qu x) {
return __builtin_ia32_ptestz128(~__v16qu(), x == 0);
}
This equivalent to testing scalars for 0, i.e. we can optimize to:
vptest %xmm0, %xmm0
sete %al
ret
// test for certain bits
bool bad3(__v16qu x, __v16qu k) {
return __builtin_ia32_ptestz128(~__v16qu(), (x & k) == 0);
}
With the above transformation we already get PTEST(x&k, x&k) which can
consequently be reduced to PTEST(x, k):
vptest %xmm0, %xmm1
sete %al
ret
Further optimization of e.g. `(x & ~k) == 0` using CF instead of ZF might also
be interesting.
And of course, these transformations apply to all vector types, not just
__v16qu.