michaelselehov wrote:

Reverts #185083.

This change causes a numerical-correctness regression on AMDGPU: HIP FP16 
wavefront reductions (`__reduce_{min,max}_sync<__half>`) return wrong values on 
gfx90a (verified on MI210). The PR author has explicitly given the green light 
to revert while they investigate further.

### Root cause

HIP declares `__half` as a doubly-nested aggregate 
(`hip/amd_detail/amd_hip_fp16.h`):

```
%struct.__half = type { %union.anon }
%union.anon    = type { half }
```

With this PR in place, clang's AMDGPU ABI lowering passes the HIP `__half` via 
`ABIArgInfo::getDirect()` as an aggregate. The kernel's call into the 
device-libs wave-reduction helper comes out as:

```llvm
%fca = insertvalue %union.anon poison, half %x, 0
%c   = tail call %struct.__half @__ockl_wfred_min_f16(%union.anon %fca) #9
```

The body of `__ockl_wfred_min_f16` comes from pre-built `rocm-device-libs` 
bitcode, whose source uses a plain `half` parameter:

```llvm
define internal half @__ockl_wfred_min_f16(half noundef %0)
```

Call-site argument type (`%union.anon`) does not match the callee formal type 
(`half`), and likewise for the return type. LLVM's `InlineFunction()` 
conservatively refuses to inline calls with mismatched argument SSA types, so 
this `convergent` wave-reduction helper stays out-of-line.

This matters because the kernel calls the helper from inside a divergent region 
(the `__reduce_*_sync` lane-mask `if`). AMDGPU lowers a call inside divergent 
control by saving EXEC and entering the callee with EXEC = (lanes that took the 
branch). The `update.dpp` / `ds_bpermute` intrinsics in the helper are 
`convergent` and physically read from neighbouring lanes' VGPRs -- but those 
neighbours never entered the callee, so the DPP picks up stale data left in 
`v0` from before the call (an unrelated input, a previous result, whatever 
happened to be there). When the helper *is* inlined, the same intrinsics run 
under the kernel's full-wave EXEC and the inactive lanes' inputs are explicitly 
replaced with identity values (`0x7C00` / `0xFC00`) before the DPP, so the 
reduction is correct. The numerical failure pattern matches exactly: `min` 
collapses toward 0 (stale `0x0000` wins), `max` picks up an unrelated neighbour.

Coordinated rebuild of `rocm-device-libs` does not fix this. The device-libs 
source uses plain `half`, not HIP's `__half`, so its IR signature stays 
`half(half)` regardless of which clang builds it. The mismatch is structural -- 
between the HIP header type and the device-libs source type -- and was 
previously masked because pre-#185083 the aggregate got coerced to a 
scalar-shaped value that the inliner could trivially bitcast through.

### Verification

Reproduced with a single-translation-unit HIP program at `-O2` 
(`__reduce_{min,max}_sync<__half>` over a small random buffer, then a host-side 
bitwise compare): ~48k FP16 mismatches with original PR in place, 0 mismatches 
with this revert applied. 

https://github.com/llvm/llvm-project/pull/199981
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to