[llvm-bugs] [Bug 166190] [clang] Incorrect codegen for some `_lane_` intrinsics with O2 on big-endian aarch64

LLVM Bugs via llvm-bugs Mon, 03 Nov 2025 08:44:44 -0800

Issue	166190
Summary	[clang] Incorrect codegen for some `_lane_` intrinsics with O2 on big-endian aarch64
Labels	clang
Assignees
Reporter	CrooseGit

    This is definitely an issue for `vdot_lane_s32`, but I expect the problem runs deeper to many other `_lane_` intrinsics as listed at the end


# Instructions for reproduction:
## Code:
```cpp
#include <iostream>
#include <string>
#include <cstring>
#include <iomanip>
#include <sstream>
#include <type_traits>
#include <cassert>
#include <arm_neon.h>
#include <arm_acle.h>
#include <arm_fp16.h>

#ifdef __aarch64__
std::ostream& operator<<(std::ostream& os, poly128_t value);
#endif

std::ostream& operator<<(std::ostream& os, float16_t value);
std::ostream& operator<<(std::ostream& os, uint8_t value);



int run_vdot_lane_s32() {
    alignas(64) const int32_t r_val_vals[] = {
        0x0,
        0x800000,
        0x3effffff,
 0x3f000000,
        0x3f000001,
        0x3f7fffff,
 0x3f800000,
        0x3f800001,
        0x3fc00000,
        0x41200000,
 0x7f8fffff,
        0x7f800000,
        0x7fd23456,
 0x7fc00000,
        0x7f923456,
        0x7f800001,
        0x123456,
 0x7fffff,
        0x1,
        -0x7fffffff - 1,
        -0x7f800000
 };
    alignas(64) const int8_t a_val_vals[] = {
        0x0,
 0x1,
        0x2,
        0x3,
        0x4,
        0x5,
        0x6,
 0x7,
        0x8,
        0x9,
        0xa,
        0xb,
 0xc,
        0xd,
        0xe,
        0xf,
        -0x10,
        -0x7f - 1,
        0x3b,
        -0x1,
        0x0,
        0x1,
 0x2,
        0x3,
        0x4,
        0x5,
        0x6
    };
 alignas(64) const int8_t b_val_vals[] = {
        0x0,
        0x1,
 0x2,
        0x3,
        0x4,
        0x5,
        0x6,
        0x7,
 0x8,
        0x9,
        0xa,
        0xb,
        0xc,
 0xd,
        0xe,
        0xf,
        -0x10,
        -0x7f - 1,
 0x3b,
        -0x1,
        0x0,
        0x1,
        0x2,
 0x3,
        0x4,
        0x5,
        0x6
    };
        const int32_t lane_val = (const int32_t)0;
        for (int i=0; i<20; i++) {
 int32x2_t r_val =vld1_s32(&r_val_vals[i]);
            int8x8_t a_val = vld1_s8(&a_val_vals[i]);
            int8x8_t b_val = vld1_s8(&b_val_vals[i]);
            auto __return_value = vdot_lane_s32(r_val, a_val, b_val, lane_val);
            std::cout << "Result -0-" << i+1 << ": int32x2_t(" << std::fixed << std::setprecision(150) <<  vget_lane_s32(__return_value, 0) << ", " << vget_lane_s32(__return_value, 1) << ")" << std::endl;
        }
 return 0;
}

int main(){
  run_vdot_lane_s32();
}
```
(It may not look like the minimal reproduction, but cutting it down any further allows it to optimise its way out of the issue. This is the generated code from the rust test suite)

## Then
Compile with clang for `aarch64_be-unknown-linux-gnu` with `O2` and run.

# Output
The above instructions should lead to the production of this erroneous output
```
Result -0-1: int32x2_t(4, 8388636)
Result -0-2: int32x2_t(8388628, 1056964667)
Result -0-3: int32x2_t(1056964651, 1056964708)
Result -0-4: int32x2_t(1056964684, 1056964757)
[...]
```
If you then repeat this without O2, you will get the correct output of
```
Result -0-1: int32x2_t(14, 8388646)
Result -0-2: int32x2_t(8388638, 1056964677)
Result -0-3: int32x2_t(1056964661, 1056964718)
Result -0-4: int32x2_t(1056964694, 1056964767)
Result -0-5: int32x2_t(1056964735, 1065353429)
[...]
```
# Incorrect codegen:
As far as I can tell, the issue in the codegen lies in the `rev32` instruction produced before the invocation of the intrinsic's underlying instruction `sdot`
```asm
ld1	{ v0.8b }, [x8]
mov	w2, #10                         // =0xa
rev32	v1.16b, v0.16b
sdot	v2.2s, v0.8b, v1.4b[0]
```
Removing this `rev32` (and changing the final sdot operand) then compiling the assembly seems to fix the issue
```asm
ld1	{ v0.8b }, [x8]
mov	w2, #10 // =0xa
sdot	v2.2s, v0.8b, v0.4b[0]
```

# Suspected other faulty intrinsics in O2 on big-endian
```
vcmla_lane_f16
vcmla_laneq_f16
vcmla_rot180_lane_f16
vcmla_rot180_laneq_f16
vcmla_rot270_lane_f16
vcmla_rot270_laneq_f16
vcmla_rot90_lane_f16
vcmla_rot90_laneq_f16
vcmlaq_lane_f16
vcmlaq_laneq_f16
vcmlaq_laneq_f32
vcmlaq_rot180_lane_f16
vcmlaq_rot180_laneq_f16
vcmlaq_rot180_laneq_f32
vcmlaq_rot270_lane_f16
vcmlaq_rot270_laneq_f16
vcmlaq_rot270_laneq_f32
vcmlaq_rot90_lane_f16
vcmlaq_rot90_laneq_f16
vcmlaq_rot90_laneq_f32
vdot_lane_s32
vdot_lane_u32
vdot_laneq_s32
vdot_laneq_u32
vdotq_lane_s32
vdotq_lane_u32
vdotq_laneq_s32
vdotq_laneq_u32
vsudot_lane_s32
vsudot_laneq_s32
vsudotq_lane_s32
vsudotq_laneq_s32
vusdot_lane_s32
vusdot_laneq_s32
vusdotq_lane_s32
vusdotq_laneq_s32
```

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 166190] [clang] Incorrect codegen for some `_lane_` intrinsics with O2 on big-endian aarch64

Reply via email to