https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123099

            Bug ID: 123099
           Summary: Compare reduction and predicate reduction patterns are
                    missed
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: uis9936 at gmail dot com
  Target Milestone: ---

While reading one paper about parsers, I noticed how long dependency chain for
getting bitmask was on ARM64, which led me to try and come up with a way to
hide latency when no match was found. Which is basically a form reduction. Now
I decided to look how GCC autovectorizer compares to generalised no-match
reduction.

For that I made two kinds of functions: one that checks all of elements are
zeros and one that checks none of elements are zeros.

```
bool nozeros (__attribute__((vector_size(sizeof(unsigned int)*4))) unsigned int
i) {
    auto t = i == 0; return !(t[0]||t[1]||t[2]||t[3]);
    //auto t = i; return t[0]&&t[1]&&t[2]&&t[3];
    //auto t = i != 0; return t[0]&&t[1]&&t[2]&&t[3];
}

bool anynonzero (__attribute__((vector_size(sizeof(unsigned int)*4))) unsigned
int i) {
    //auto t = i != 0; return (t[0]||t[1]||t[2]||t[3]);
    auto t = i; return (t[0]||t[1]||t[2]||t[3]);
    //auto t = i == 0; return !(t[0]&&t[1]&&t[2]&&t[3]);
}
```

I left uncommented lines that produced intended instructions on x86. Each
commented line of same function leads to same result as uncommented one in same
function.
On ARM generated code for all variants does comparisons in scalar registers
with a lot of instructions(and latency).
My idea of checking for non-zero values on ARM NEON is !!vmaxvq_u32(input).

Should bug be split in target-independent and arm-specific parts?

Reply via email to