Issue |
141931
|
Summary |
AMDGPU should not scalarize v2f16 / v2bf16 copysign
|
Labels |
backend:AMDGPU,
missed-optimization
|
Assignees |
|
Reporter |
arsenm
|
Currently half element copysign is scalarized and produces this ugly expansion:
```
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s
; s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; s_movk_i32 s4, 0x7fff
; v_bfi_b32 v2, s4, v0, v1
; v_lshrrev_b32_e32 v1, 16, v1
; v_lshrrev_b32_e32 v0, 16, v0
; v_bfi_b32 v0, s4, v0, v1
; s_mov_b32 s4, 0x5040100
; v_perm_b32 v0, v0, v2, s4
; s_setpc_b64 s[30:31]
define <2 x half> @copysign_v2f16(<2 x half> %a, <2 x half> %b) {
%result = call <2 x half> @llvm.copysign.v2f16(<2 x half> %a, <2 x half> %b)
ret <2 x half> %result
}
```
If I hack up the vector legalizer's logic, the default expansion finds a vector BFI:
WIth gx803:
```
s_mov_b32 s4, 0x7fff7fff
v_bfi_b32 v0, s4, v0, v1
```
With gfx9+, it does worse:
```
v_and_b32_e32 v1, 0x80008000, v1
s_mov_b32 s4, 0x7fff7fff
v_and_or_b32 v0, v0, s4, v1
```
We can trivially extend the existing legal f16 copysign pattern to handle the 2 element case like in the gfx8 output. It's a little more work than that to support the cases where the sign source is a different FP type
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs