| Issue |
52868
|
| Summary |
Inefficient code generated for vmull_high_p8 in complex loops
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
uncleasm
|
All but the most trivial uses of `vmull_high_u8` seem to producing unnecessarily bloated code and seem to make an unnecessary copy of the high part of a neon register:
```
#include "arm_neon.h"
#define ENABLE_ISSUE 1
inline poly16x8_t vmull_low_p8(poly8x16_t a, poly8x16_t b) {
return vmull_p8(vget_low_p8(a), vget_low_p8(b));
}
poly16x8x2_t p(const poly8_t *input, int len, poly8x16_t x, poly8x16_t X) {
auto ptr = input + len;
auto L = vdupq_n_p16(*--ptr), H = L;
#if ENABLE_ISSUE
while (ptr > input)
#endif
{
auto s = vuzpq_p8(vreinterpretq_p8_p16(L), vreinterpretq_p8_p16(H));
auto a = vmull_low_p8(s.val[0], x);
auto b = vmull_high_p8(s.val[0], x);
auto A = vmull_low_p8(s.val[1], X);
auto B = vmull_high_p8(s.val[1], X);
auto C = vdupq_n_p16(*--ptr);
L = C ^ a ^ A;
H = C ^ b ^ B;
}
return {L,H};
}
```
When the issue is enabled, the following code is generated:
```
...
ext v3.16b, v6.16b, v6.16b, #8
ext v7.16b, v2.16b, v2.16b, #8
pmull v6.8h, v6.8b, v0.8b
pmull v2.8h, v2.8b, v1.8b
pmull v3.8h, v3.8b, v4.8b
pmull v7.8h, v7.8b, v5.8b
...
```
Instead, one would expect to have
```
...
pmull v6.8h, v6.8b, v0.8b
pmull v2.8h, v2.8b, v1.8b
pmull2 v3.8h, v6.8b, v4.8b
pmull2 v7.8h, v2.8b, v5.8b
```
just like in the less complex case without the loop
This issue was recently seen also with regular arithmetic with `vmull_high_u8` cases too, but has been resolved in clang trunk.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs