Issue 52868
Summary Inefficient code generated for vmull_high_p8 in complex loops
Labels new issue
Assignees
Reporter uncleasm
    All but the most trivial uses of `vmull_high_u8` seem to producing unnecessarily bloated code and seem to make an unnecessary copy of the high part of a neon register:

```
#include "arm_neon.h"
#define ENABLE_ISSUE 1
inline poly16x8_t vmull_low_p8(poly8x16_t a, poly8x16_t b) {
    return vmull_p8(vget_low_p8(a), vget_low_p8(b));
}

poly16x8x2_t p(const poly8_t *input, int len, poly8x16_t x, poly8x16_t X) {
    auto ptr = input + len;
    auto L = vdupq_n_p16(*--ptr), H = L;
#if ENABLE_ISSUE
    while (ptr > input)
#endif 
    {
        auto s = vuzpq_p8(vreinterpretq_p8_p16(L), vreinterpretq_p8_p16(H));
        auto a = vmull_low_p8(s.val[0], x);
        auto b = vmull_high_p8(s.val[0], x);
        auto A = vmull_low_p8(s.val[1], X);
        auto B = vmull_high_p8(s.val[1], X);
        auto C = vdupq_n_p16(*--ptr);
        L = C ^ a ^ A;
        H = C ^ b ^ B;
    }
    return {L,H};
}
```

When the issue is enabled, the following code is generated:

```
        ...
        ext     v3.16b, v6.16b, v6.16b, #8
        ext     v7.16b, v2.16b, v2.16b, #8
        pmull   v6.8h, v6.8b, v0.8b
        pmull   v2.8h, v2.8b, v1.8b
        pmull   v3.8h, v3.8b, v4.8b
        pmull   v7.8h, v7.8b, v5.8b
        ...
```

Instead, one would expect to have
```
        ...       
        pmull   v6.8h, v6.8b, v0.8b
        pmull   v2.8h, v2.8b, v1.8b
        pmull2   v3.8h, v6.8b, v4.8b
        pmull2   v7.8h, v2.8b, v5.8b
```

just like in the less complex case without the loop
This issue was recently seen also with regular arithmetic with `vmull_high_u8` cases too, but has been resolved in clang trunk. 

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to