Issue 63833
Summary ARM: -O3 avoids post-index immediate offset instructions unnecessarily
Labels new issue
Assignees
Reporter johnstiles-google
    Consider the following loop, which copies scalar data into vectors: [https://godbolt.org/z/E38feYWPd](https://godbolt.org/z/E38feYWPd)

Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate _smaller_ code, which is what -Oz is designed to do.

This approach would save two instructions:

 add     x8, x0, w1, uxtw
        add     x11, x0, x1, lsr #32
 ld1r    { v0.4s }, [x8], #4
        ld1r    { v1.4s }, [x8], #4
 ld1r    { v2.4s }, [x8], #4
        ld1r    { v3.4s }, [x8]
 stp     q0, q1, [x11]
        stp     q2, q3, [x11, #32]
 ret

For _even smaller_ code, Clang could even leverage `ld4r` to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to