[llvm-bugs] [Bug 166808] [AArch64] Excessive vectorization of scalar rotates results in poor performance

LLVM Bugs via llvm-bugs Thu, 06 Nov 2025 09:18:18 -0800

Issue	166808
Summary	[AArch64] Excessive vectorization of scalar rotates results in poor performance
Labels	new issue
Assignees
Reporter	zeux

    Compiling the attached file, [test.cpp](https://github.com/user-attachments/files/23398116/test.cpp), targeting AArch64 with -O3 flag, results in suboptimal code generation. Specifically, the tail of the loop in C intends to copy the vector to scalar registers (`rotateleft64` is a wrapper around `__builtin_rotateleft64`):


		// store results to stack so that we can rotate using scalar instructions
		uint64_t res[4];
		vst1q_u64((uint64_t*)&res[0], res_0);
		vst1q_u64((uint64_t*)&res[2], res_1);

		// rotate and store
		uint64_t* out = reinterpret_cast<uint64_t*>(&data[i * 4]);

		out[0] = rotateleft64(res[0], data[(i + 0) * 4 + 3] << 4);
		out[1] = rotateleft64(res[1], data[(i + 1) * 4 + 3] << 4);
		out[2] = rotateleft64(res[2], data[(i + 2) * 4 + 3] << 4);
		out[3] = rotateleft64(res[3], data[(i + 3) * 4 + 3] << 4);

... however, clang vectorizes this instead. Because A64 doesn't have vector forms of rotates, the vectorization is quite unprofitable, as it needs to synthesize the rotates out of shifts, ands, adds, ors, as well as getting the shift mask into the right registers. Compared to a version of this loop where `res` is marked as `volatile`, the `volatile` version (that uses vector stack stores & GPR loads) has 60 instructions in the loop body, whereas without `volatile` with the vectorization it has 70 instructions. Running the code confirms that this is a pessimization: on Graviton 3 CPU, the `volatile` variant runs at ~5 GB/s, and without the `volatile` it runs at ~3.6 GB/s, so the slowdown is quite significant. (note, the performance delta is significant and is likely at least partially due to the fact that scalar ops can run on some ports that are otherwise free, while the next iteration runs vector instructions)

It would be nice if clang... didn't do this. gcc doesn't, and gets 5 GB/s on Graviton 3 without `volatile`.

Godbolt link for easy testing: https://gcc.godbolt.org/z/azPoGh1ns (uncomment volatile on line 70 to compare codegen)

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 166808] [AArch64] Excessive vectorization of scalar rotates results in poor performance

Reply via email to