| Issue |
63833
|
| Summary |
ARM: -O3 avoids post-index immediate offset instructions unnecessarily
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
johnstiles-google
|
Consider the following loop, which copies scalar data into vectors: [https://godbolt.org/z/E38feYWPd](https://godbolt.org/z/E38feYWPd)
Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate _smaller_ code, which is what -Oz is designed to do.
This approach would save two instructions:
add x8, x0, w1, uxtw
add x11, x0, x1, lsr #32
ld1r { v0.4s }, [x8], #4
ld1r { v1.4s }, [x8], #4
ld1r { v2.4s }, [x8], #4
ld1r { v3.4s }, [x8]
stp q0, q1, [x11]
stp q2, q3, [x11, #32]
ret
For _even smaller_ code, Clang could even leverage `ld4r` to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs