[llvm-bugs] [Bug 166657] [AMDGPU] register spill instructions are generated inside control flow with `exec=0`

LLVM Bugs via llvm-bugs Wed, 05 Nov 2025 14:45:39 -0800

Issue	166657
Summary	[AMDGPU] register spill instructions are generated inside control flow with `exec=0`
Labels	new issue
Assignees
Reporter	raiseirql

    Register spills generated by the backend can be scheduled in regions of code where `exec=0`, so the instructions are not executed. Kernels with these spills then crash or produce incorrect results.


Apologies for the long repro case: this is an attention kernel generated by the mojo compiler. I tried to create a simplified repro case but could not hit the condition.

In the repro, the kernel is using `readfirstlane` for the scalar buffer resource. The recently added `amdgpu-uniform-intrinsic-combine` pass correctly determines that these are uniform and removes the `readfirstlane` intrinsics. But due to later instruction scheduling, `si-fix-sgpr-copies` generates code assuming this is not uniform and there is loop generated like this (I'm opening a different issue for the `readfirstlane` problem):

```assembly
        s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
 s_mov_b64 s[38:39], exec
.LBB0_17:                               ; Parent Loop BB0_7 Depth=1
                                        ; =>  This Inner Loop Header: Depth=2
        v_readfirstlane_b32 s4, v0
 v_readfirstlane_b32 s5, v1
        v_readfirstlane_b32 s6, v254
 v_readfirstlane_b32 s7, v255
        v_cmp_eq_u64_e32 vcc, s[4:5], v[0:1]
 s_nop 0
        v_cmp_eq_u64_e64 s[2:3], s[6:7], v[254:255]
 s_and_b64 s[2:3], vcc, s[2:3]
        s_and_saveexec_b64 s[2:3], s[2:3]
 buffer_load_dwordx4 v[2:5], v26, s[4:7], s48 offen
        s_xor_b64 exec, exec, s[2:3]
        s_cbranch_execnz .LBB0_17
; %bb.18: ;   in Loop: Header=BB0_7 Depth=1
 v_accvgpr_write_b32 a97, v13
        v_accvgpr_write_b32 a88, v10
 s_cmp_lg_u32 s55, 0
        s_mov_b64 exec, s[38:39]
        s_cselect_b32 s55, 1, 0
        s_mov_b64 s[38:39], exec
``` 

The problem here is that the `v_accvgpr_write_b32` instructions are in a section of code where the `exec` register is now zero, so these instructions are masked. If I edit the final assembly to move these instructions to have after the `s_mov_b64 exec, ...` that restores the `exec` mask, then the kernel runs fine. There are multiple cases of this that occur in this example. I have also observed on MI300 cases where scratch instructions are generated in this region.

The excessive register usage here is due to this being an attention op for `head_size=256`. The kernel itself is not touching the `exec` register behind the back of the compiler. The expectation would be that the above transforms would generate slow but functional code.

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 166657] [AMDGPU] register spill instructions are generated inside control flow with `exec=0`

Reply via email to