| Issue |
52866
|
| Summary |
Missed horizontal reduction in armv8
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
uncleasm
|
The code compiled with `-O2` or `-O3` flags, using clang-13.0.0.0
```
#include <cstdint>
#include <algorithm>
using veci = int32_t __attribute__((vector_size(16)));
int32_t maxv(veci a) {
return std::max(std::max(a[0], a[1]), std::max(a[2],a[3]));
}
```
Compiles to
```
maxv(int __vector(4)): // @maxv(int __vector(4))
mov w8, v0.s[1]
fmov w11, s0
mov w9, v0.s[2]
mov w10, v0.s[3]
cmp w11, w8
csel w8, w8, w11, lt
cmp w9, w10
csel w9, w10, w9, lt
cmp w8, w9
csel w0, w9, w8, lt
ret
```
, where it should be compiled to
```
maxv(int __vector(4)): // @maxv(int __vector(4))
smaxv s0, v0.4s
fmov w0, s0
ret
```
In contrast, the x64 backend (with -msse4) is able to perform cross lane comparison with shuffles - technique, that would be available in armv8 as well (with `b = vextq_s32(a,a,2); a = vmaxq_s32(a,b); b = vextq_s32(a,a,1); a = vmaxq_s32(a,b);`)
Another case highlighting the missed vectorised comparison would be
```
#include <cmath>
using vecf = float __attribute__((vector_size(16)));
bool isfinite_ref(vecf a) {
return std::isfinite(a[0]) &
std::isfinite(a[1])&
std::isfinite(a[2])&
std::isfinite(a[3]);
}
```
which shows very verbose assembler compared to intel-sse2, which is implemented in parallel.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs