https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123099
Bug ID: 123099
Summary: Compare reduction and predicate reduction patterns are
missed
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: uis9936 at gmail dot com
Target Milestone: ---
While reading one paper about parsers, I noticed how long dependency chain for
getting bitmask was on ARM64, which led me to try and come up with a way to
hide latency when no match was found. Which is basically a form reduction. Now
I decided to look how GCC autovectorizer compares to generalised no-match
reduction.
For that I made two kinds of functions: one that checks all of elements are
zeros and one that checks none of elements are zeros.
```
bool nozeros (__attribute__((vector_size(sizeof(unsigned int)*4))) unsigned int
i) {
auto t = i == 0; return !(t[0]||t[1]||t[2]||t[3]);
//auto t = i; return t[0]&&t[1]&&t[2]&&t[3];
//auto t = i != 0; return t[0]&&t[1]&&t[2]&&t[3];
}
bool anynonzero (__attribute__((vector_size(sizeof(unsigned int)*4))) unsigned
int i) {
//auto t = i != 0; return (t[0]||t[1]||t[2]||t[3]);
auto t = i; return (t[0]||t[1]||t[2]||t[3]);
//auto t = i == 0; return !(t[0]&&t[1]&&t[2]&&t[3]);
}
```
I left uncommented lines that produced intended instructions on x86. Each
commented line of same function leads to same result as uncommented one in same
function.
On ARM generated code for all variants does comparisons in scalar registers
with a lot of instructions(and latency).
My idea of checking for non-zero values on ARM NEON is !!vmaxvq_u32(input).
Should bug be split in target-independent and arm-specific parts?