Issue 180678
Summary [AArch64] vmlaq_laneq_u32 intrinsic is pessimized into dup/add
Labels new issue
Assignees
Reporter zeux
    When `vmlaq_laneq_u32` intrinsic is used in a loop that computes a prefix sum, with the multiplier being a constant vector consisting of 1s and 0s, LLVM attempts to strength-reduce it into component adds. That attempt results in more instructions that take more cycles to execute, both in real world testing (on Apple Silicon) as well as according to llvm-mca models.

To reproduce, compile the file with -O3. It has two identical functions, `prefix_sum_u32` and `prefix_sum_u32_bad`:
[prefixsum.c](https://github.com/user-attachments/files/25199990/prefixsum.c)

They differ in that `prefix_sum_u32` "clobbers" the constants using inline assembly blocks, and `prefix_sum_u32_bad` doesn't. I would expect both to compile to the `mla` intrinsic; instead, the "bad" version compiles to dups/adds, which defeats the purpose of the explicit optimization here.

Running llvm-mca with `-march=aarch64 -mcpu=apple-a19` on the loops produces the following estimates:

- `prefix_sum_u32`: 2.50 cycles/iter
- `prefix_sum_u32_bad`: 3.76 cycles/iter

In an equivalent (but more complex real-world) code, I observe a slowdown on Apple M4 too, so the regression is not just on the modeling side; but it doesn't even appear that LLVM honors its own cost models here.

Godbolt link: https://godbolt.org/z/adMY18Yq9
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to