| Issue |
168761
|
| Summary |
[AMDGPU] regression from AV class change in `SITargetLowering::getRegClassFor`
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
raiseirql
|
We (Modular) picked up LLVM commit `4d42a0c3f139f41fb7409e7831f21ab9bca40a0c`. This includes the changes from https://github.com/llvm/llvm-project/pull/166483. There is a regression in the code generation for one of the routines in our GPU `print()` path.
See https://gist.github.com/raiseirql/4b1815844ee452f68f7b2c4dd8625feb for a reduced repro. Specifically, see the block that is bracketed by `llvm.debugtrap`. This code is generated from https://github.com/modular/modular/blob/e8cce25027e913cdb54baf2aedc1789a11aa5301/mojo/stdlib/stdlib/builtin/_format_float.mojo#L174. Each lane is doing a print of a float value and this code is counting how many characters are needed. The values are expected to diverge based on the per-lane float value. The end result is the print output is corrupted.
If we change `SITargetLowering::getRegClassFor` to remove the code to return an AV class, then the correct code is produced.
```
if (TRI->isSGPRClass(RC) && isDivergent) {
// Disable the new code to fix codegen.
#if 0
if (Subtarget->hasGFX90AInsts())
return TRI->getEquivalentAVClass(RC);
#endif
return TRI->getEquivalentVGPRClass(RC);
}
```
The working codegen:
```
s_add_u32 s4, s4, 1
v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
v_lshlrev_b16_e32 v3, 8, v3
s_addc_u32 s5, s5, 0
v_lshlrev_b32_e32 v4, 16, v4
v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
s_or_b64 s[6:7], vcc, s[6:7]
v_mov_b64_e32 v[68:69], s[4:5] <<<<< this captures the loop index that diverges across threads
v_or_b32_sdwa v66, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
s_andn2_b64 exec, exec, s[6:7]
s_cbranch_execnz .LBB5_30
s_or_b64 exec, exec, s[6:7]
.LBB5_32:
s_or_b64 exec, exec, s[2:3]
s_mov_b64 s[22:23], 0
v_cmp_lt_i64_e32 vcc, 0, v[68:69]
s_mov_b64 s[0:1], -1
s_mov_b64 s[26:27], 0
s_trap 3
```
The broken codegen:
```
s_add_u32 s4, s4, 1
v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
v_lshlrev_b16_e32 v3, 8, v3
s_addc_u32 s5, s5, 0
v_lshlrev_b32_e32 v4, 16, v4
v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
s_or_b64 s[6:7], vcc, s[6:7]
v_or_b32_sdwa v68, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
s_andn2_b64 exec, exec, s[6:7]
s_cbranch_execnz .LBB5_30
; %bb.31: ; %Flow35
s_or_b64 exec, exec, s[6:7]
v_mov_b64_e32 v[70:71], s[4:5] <<<<< this move should be inside the above loop, effectively captures max(idx)
.LBB5_32: ; %Flow36
s_or_b64 exec, exec, s[2:3]
s_mov_b64 s[22:23], 0
v_cmp_lt_i64_e32 vcc, 0, v[70:71]
s_mov_b64 s[0:1], -1
s_mov_b64 s[26:27], 0
s_trap 3
```
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs