Issue 168761
Summary [AMDGPU] regression from AV class change in `SITargetLowering::getRegClassFor`
Labels new issue
Assignees
Reporter raiseirql
    We (Modular) picked up LLVM commit `4d42a0c3f139f41fb7409e7831f21ab9bca40a0c`. This includes the changes from https://github.com/llvm/llvm-project/pull/166483. There is a regression in the code generation for one of the routines in our GPU `print()` path.

See https://gist.github.com/raiseirql/4b1815844ee452f68f7b2c4dd8625feb for a reduced repro. Specifically, see the block that is bracketed by `llvm.debugtrap`. This code is generated from https://github.com/modular/modular/blob/e8cce25027e913cdb54baf2aedc1789a11aa5301/mojo/stdlib/stdlib/builtin/_format_float.mojo#L174. Each lane is doing a print of a float value and this code is counting how many characters are needed. The values are expected to diverge based on the per-lane float value. The end result is the print output is corrupted.

If we change `SITargetLowering::getRegClassFor` to remove the code to return an AV class, then the correct code is produced.
```
  if (TRI->isSGPRClass(RC) && isDivergent) {
// Disable the new code to fix codegen.
#if 0     
 if (Subtarget->hasGFX90AInsts())
      return TRI->getEquivalentAVClass(RC);
#endif
    return TRI->getEquivalentVGPRClass(RC);
  }
``` 

The working codegen:
```
 s_add_u32 s4, s4, 1
        v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
 v_lshlrev_b16_e32 v3, 8, v3
        s_addc_u32 s5, s5, 0
 v_lshlrev_b32_e32 v4, 16, v4
        v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
        s_or_b64 s[6:7], vcc, s[6:7]
        v_mov_b64_e32 v[68:69], s[4:5] <<<<< this captures the loop index that diverges across threads
        v_or_b32_sdwa v66, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
        s_andn2_b64 exec, exec, s[6:7]
        s_cbranch_execnz .LBB5_30
        s_or_b64 exec, exec, s[6:7]
.LBB5_32:
        s_or_b64 exec, exec, s[2:3]
 s_mov_b64 s[22:23], 0
        v_cmp_lt_i64_e32 vcc, 0, v[68:69]
 s_mov_b64 s[0:1], -1
        s_mov_b64 s[26:27], 0
        s_trap 3
``` 

The broken codegen:
```
        s_add_u32 s4, s4, 1
 v_bitop3_b16 v4, v5, v8, s10 bitop3:0xec
        v_lshlrev_b16_e32 v3, 8, v3
        s_addc_u32 s5, s5, 0
        v_lshlrev_b32_e32 v4, 16, v4
 v_bitop3_b16 v3, v7, v3, s10 bitop3:0xec
        s_or_b64 s[6:7], vcc, s[6:7]
        v_or_b32_sdwa v68, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
        s_andn2_b64 exec, exec, s[6:7]
 s_cbranch_execnz .LBB5_30
; %bb.31:                               ; %Flow35
        s_or_b64 exec, exec, s[6:7]
        v_mov_b64_e32 v[70:71], s[4:5] <<<<< this move should be inside the above loop, effectively captures max(idx)
.LBB5_32:                               ; %Flow36
        s_or_b64 exec, exec, s[2:3]
        s_mov_b64 s[22:23], 0
        v_cmp_lt_i64_e32 vcc, 0, v[70:71]
        s_mov_b64 s[0:1], -1
        s_mov_b64 s[26:27], 0
        s_trap 3
``` 

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to