| Issue |
166657
|
| Summary |
[AMDGPU] register spill instructions are generated inside control flow with `exec=0`
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
raiseirql
|
Register spills generated by the backend can be scheduled in regions of code where `exec=0`, so the instructions are not executed. Kernels with these spills then crash or produce incorrect results.
Apologies for the long repro case: this is an attention kernel generated by the mojo compiler. I tried to create a simplified repro case but could not hit the condition.
In the repro, the kernel is using `readfirstlane` for the scalar buffer resource. The recently added `amdgpu-uniform-intrinsic-combine` pass correctly determines that these are uniform and removes the `readfirstlane` intrinsics. But due to later instruction scheduling, `si-fix-sgpr-copies` generates code assuming this is not uniform and there is loop generated like this (I'm opening a different issue for the `readfirstlane` problem):
```assembly
s_cmp_lg_u32 s55, 0
s_mov_b64 exec, s[38:39]
s_cselect_b32 s55, 1, 0
s_mov_b64 s[38:39], exec
.LBB0_17: ; Parent Loop BB0_7 Depth=1
; => This Inner Loop Header: Depth=2
v_readfirstlane_b32 s4, v0
v_readfirstlane_b32 s5, v1
v_readfirstlane_b32 s6, v254
v_readfirstlane_b32 s7, v255
v_cmp_eq_u64_e32 vcc, s[4:5], v[0:1]
s_nop 0
v_cmp_eq_u64_e64 s[2:3], s[6:7], v[254:255]
s_and_b64 s[2:3], vcc, s[2:3]
s_and_saveexec_b64 s[2:3], s[2:3]
buffer_load_dwordx4 v[2:5], v26, s[4:7], s48 offen
s_xor_b64 exec, exec, s[2:3]
s_cbranch_execnz .LBB0_17
; %bb.18: ; in Loop: Header=BB0_7 Depth=1
v_accvgpr_write_b32 a97, v13
v_accvgpr_write_b32 a88, v10
s_cmp_lg_u32 s55, 0
s_mov_b64 exec, s[38:39]
s_cselect_b32 s55, 1, 0
s_mov_b64 s[38:39], exec
```
The problem here is that the `v_accvgpr_write_b32` instructions are in a section of code where the `exec` register is now zero, so these instructions are masked. If I edit the final assembly to move these instructions to have after the `s_mov_b64 exec, ...` that restores the `exec` mask, then the kernel runs fine. There are multiple cases of this that occur in this example. I have also observed on MI300 cases where scratch instructions are generated in this region.
The excessive register usage here is due to this being an attention op for `head_size=256`. The kernel itself is not touching the `exec` register behind the back of the compiler. The expectation would be that the above transforms would generate slow but functional code.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs