Lunderberg commented on PR #17283:
URL: https://github.com/apache/tvm/pull/17283#issuecomment-2305055694

   I think I've tracked down the problem.  Writing down the steps to record it, 
and to collect all the links in one spot.
   
   1. The `__truncsfhf2` and `__extendhfsf2` are builtins provided by LLVM.  
Calls to these builtins are generated from the LLVM IR instructions `fptrunc` 
and `fpext`.
   2. In LLVM 15, the ABI of these builtins was changed from accepting `uint16` 
arguments to accepting `_Float16` arguments (see [this 
thread](https://github.com/llvm/llvm-project/issues/56854).  As a result, 
TVM-generated code that was compiled under LLVM 15 would be incompatible with 
the LLVM 14 runtime.
   3. As a result of (2), TVM 
[PR#12877](https://github.com/apache/tvm/pull/12877) would inject local 
overrides of `__truncsfhf2` and `__extendhfsf2`.  That way, when LLVM lowers 
`fpext` to `__extendhfsf2`, it uses our local override of `__extendhfsf2`.  The 
choice of which `__extendhfsf2` is based on whether the target supports SSE2, 
matching the decision made by LLVM.
   4. After ([this 
commit](https://github.com/llvm/llvm-project/commit/489bda6a9c0ec9d2644b7bb0c230294d38f7296e)),
 LLVM performs a per-architecture check to determine whether the compiler 
supports the use of `_Float16` use of float16.  The first release containing 
the commit was LLVM 17.
   
       However, this commit also switches from using the LLVM 
`builtin_check_c_compiler_source` to the cmake `check_c_source_compiles`.  The 
former only attempts to compile a string ([source 
link](https://github.com/llvm/llvm-project/blob/main/compiler-rt/cmake/Modules/BuiltinTests.cmake#L3)),
 while the latter attempts to compile and link the string ([doc 
link](https://cmake.org/cmake/help/latest/module/CheckCSourceCompiles.html)).  
If I understand correctly, since the string doesn't define `int main`, this 
check would always return false.
       
       This looks like a bug in the upstream LLVM implementation, but the 
relevance here is that it would mean that LLVM 17 would produce calls to the 
`uint32` ABI, while our replacement function would expect calls with the 
`_Float16` ABI.
   
   5. In June, the Windows CI runners updated from using LLVM 16 to LLVM 18 
([link](https://github.com/actions/runner-images/issues/10001)).  The timing of 
it is a point against this hypothesis, since the ~2 month gap between when the 
new version of LLVM was rolled out in Github and when the issue started 
occurring in TVM.  It's possible that that was just their gradual rollout, but 
that would be a stretch.
   
   To test whether there's an incompatibility between the ABI expected by LLVM, 
and the ABI that we provide, I've added more debug statements and commented out 
the call to `EmitFloat16ConversionBuiltins`.  If commenting out that line 
avoids the issue on Windows, then that would at least point to some sort of 
incompatibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to