llvmorg-github-actions[bot] wrote:
<!--LLVM PR SUMMARY COMMENT--> @llvm/pr-subscribers-backend-amdgpu Author: Joseph Huber (jhuber6) <details> <summary>Changes</summary> Summary: This matches what the AMD device libraries does. We can avoid extra steps by only performing the two steps of the Newton-Raphson approximation of 1 / x. The exceptional cases should not appear in these math functions, this is local to AMDGPU, and I verified they are bitwise identical to the AMD math functions with parity in performance now. --- Full diff: https://github.com/llvm/llvm-project/pull/203809.diff 1 Files Affected: - (modified) libclc/clc/lib/amdgpu/math/clc_recip_fast.inc (+8-1) ``````````diff diff --git a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc index 9d635cc700442..e19ec82d7566d 100644 --- a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc +++ b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc @@ -6,10 +6,17 @@ // //===----------------------------------------------------------------------===// -// On AMDGPU the "fast" reciprocal is the hardware v_rcp_f32 approximation, +// On AMDGPU the "fast" reciprocal is the hardware v_rcp approximation. _CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE __clc_recip_fast(__CLC_GENTYPE x) { #if defined(__CLC_SCALAR) && __CLC_FPSIZE == 32 return __builtin_amdgcn_rcpf(x); +#elif defined(__CLC_SCALAR) && __CLC_FPSIZE == 64 + // Hardware v_rcp_f64 seed refined with two Newton-Raphson iterations. This + // computes 1.0 / x without the full IEEE scaling and subnormal fixups. + __CLC_GENTYPE r = __builtin_amdgcn_rcp(x); + r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r); + r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r); + return r; #else return ((__CLC_GENTYPE)1.0) / x; #endif `````````` </details> https://github.com/llvm/llvm-project/pull/203809 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
