[llvm-bugs] [Bug 180678] [AArch64] vmlaq_laneq_u32 intrinsic is pessimized into dup/add

LLVM Bugs via llvm-bugs Mon, 09 Feb 2026 21:16:39 -0800

Issue	180678
Summary	[AArch64] vmlaq_laneq_u32 intrinsic is pessimized into dup/add
Labels	new issue
Assignees
Reporter	zeux

    When `vmlaq_laneq_u32` intrinsic is used in a loop that computes a prefix sum, with the multiplier being a constant vector consisting of 1s and 0s, LLVM attempts to strength-reduce it into component adds. That attempt results in more instructions that take more cycles to execute, both in real world testing (on Apple Silicon) as well as according to llvm-mca models.


To reproduce, compile the file with -O3. It has two identical functions, `prefix_sum_u32` and `prefix_sum_u32_bad`:
[prefixsum.c](https://github.com/user-attachments/files/25199990/prefixsum.c)

They differ in that `prefix_sum_u32` "clobbers" the constants using inline assembly blocks, and `prefix_sum_u32_bad` doesn't. I would expect both to compile to the `mla` intrinsic; instead, the "bad" version compiles to dups/adds, which defeats the purpose of the explicit optimization here.

Running llvm-mca with `-march=aarch64 -mcpu=apple-a19` on the loops produces the following estimates:

- `prefix_sum_u32`: 2.50 cycles/iter
- `prefix_sum_u32_bad`: 3.76 cycles/iter

In an equivalent (but more complex real-world) code, I observe a slowdown on Apple M4 too, so the regression is not just on the modeling side; but it doesn't even appear that LLVM honors its own cost models here.

Godbolt link: https://godbolt.org/z/adMY18Yq9

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 180678] [AArch64] vmlaq_laneq_u32 intrinsic is pessimized into dup/add

Reply via email to