[llvm-bugs] [Bug 63833] ARM: -O3 avoids post-index immediate offset instructions unnecessarily

LLVM Bugs via llvm-bugs Wed, 12 Jul 2023 13:30:09 -0700

Issue	63833
Summary	ARM: -O3 avoids post-index immediate offset instructions unnecessarily
Labels	new issue
Assignees
Reporter	johnstiles-google

    Consider the following loop, which copies scalar data into vectors: [https://godbolt.org/z/E38feYWPd](https://godbolt.org/z/E38feYWPd)


Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate _smaller_ code, which is what -Oz is designed to do.

This approach would save two instructions:

 add     x8, x0, w1, uxtw
        add     x11, x0, x1, lsr #32
 ld1r    { v0.4s }, [x8], #4
        ld1r    { v1.4s }, [x8], #4
 ld1r    { v2.4s }, [x8], #4
        ld1r    { v3.4s }, [x8]
 stp     q0, q1, [x11]
        stp     q2, q3, [x11, #32]
 ret

For _even smaller_ code, Clang could even leverage `ld4r` to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 63833] ARM: -O3 avoids post-index immediate offset instructions unnecessarily

Reply via email to