Lunderberg commented on PR #17283: URL: https://github.com/apache/tvm/pull/17283#issuecomment-2305055694
I think I've tracked down the problem. Writing down the steps to record it, and to collect all the links in one spot. 1. The `__truncsfhf2` and `__extendhfsf2` are builtins provided by LLVM. Calls to these builtins are generated from the LLVM IR instructions `fptrunc` and `fpext`. 2. In LLVM 15, the ABI of these builtins was changed from accepting `uint16` arguments to accepting `_Float16` arguments (see [this thread](https://github.com/llvm/llvm-project/issues/56854). As a result, TVM-generated code that was compiled under LLVM 15 would be incompatible with the LLVM 14 runtime. 3. As a result of (2), TVM [PR#12877](https://github.com/apache/tvm/pull/12877) would inject local overrides of `__truncsfhf2` and `__extendhfsf2`. That way, when LLVM lowers `fpext` to `__extendhfsf2`, it uses our local override of `__extendhfsf2`. The choice of which `__extendhfsf2` is based on whether the target supports SSE2, matching the decision made by LLVM. 4. After ([this commit](https://github.com/llvm/llvm-project/commit/489bda6a9c0ec9d2644b7bb0c230294d38f7296e)), LLVM performs a per-architecture check to determine whether the compiler supports the use of `_Float16` use of float16. The first release containing the commit was LLVM 17. However, this commit also switches from using the LLVM `builtin_check_c_compiler_source` to the cmake `check_c_source_compiles`. The former only attempts to compile a string ([source link](https://github.com/llvm/llvm-project/blob/main/compiler-rt/cmake/Modules/BuiltinTests.cmake#L3)), while the latter attempts to compile and link the string ([doc link](https://cmake.org/cmake/help/latest/module/CheckCSourceCompiles.html)). If I understand correctly, since the string doesn't define `int main`, this check would always return false. This looks like a bug in the upstream LLVM implementation, but the relevance here is that it would mean that LLVM 17 would produce calls to the `uint32` ABI, while our replacement function would expect calls with the `_Float16` ABI. 5. In June, the Windows CI runners updated from using LLVM 16 to LLVM 18 ([link](https://github.com/actions/runner-images/issues/10001)). The timing of it is a point against this hypothesis, since the ~2 month gap between when the new version of LLVM was rolled out in Github and when the issue started occurring in TVM. It's possible that that was just their gradual rollout, but that would be a stretch. To test whether there's an incompatibility between the ABI expected by LLVM, and the ABI that we provide, I've added more debug statements and commented out the call to `EmitFloat16ConversionBuiltins`. If commenting out that line avoids the issue on Windows, then that would at least point to some sort of incompatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
