This is an automated email from the ASF dual-hosted git repository.

cyx-6 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git


The following commit(s) were added to refs/heads/main by this push:
     new 230ae6c8 Include torch build/ABI in torch C-DLPack addon cache key 
(#644)
230ae6c8 is described below

commit 230ae6c86ababb170baac6c485214e0a5af30702
Author: Piotr Mazurek <[email protected]>
AuthorDate: Mon Jun 29 17:39:03 2026 +0200

    Include torch build/ABI in torch C-DLPack addon cache key (#644)
    
    ## Problem
    
    The prebuilt torch C-DLPack addon is cached under a filename derived
    only from torch **major.minor** + a coarse device string:
    
    ```python
    major, minor = torch.__version__.split(".")[:2]
    device = _torch_extension_device(torch)          # "cuda" / "rocm" / "cpu"
    libname = f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
    lib_path = cache_dir / libname                   # cache_dir defaults to 
~/.cache/tvm-ffi
    if not lib_path.exists():
        ...build...                                  # otherwise reuse whatever 
is there
    ```
    
    This key omits:
    - the torch **patch** version (`2.9.0` vs `2.9.1`),
    - the build local-version tag carried in `torch.__version__` (`+cu121`
    vs `+cu124`, `+cpu`, …),
    - the C++ ABI flag (`torch._C._GLIBCXX_USE_CXX11_ABI`).
    
    Since the addon is a compiled extension linking libtorch's C++ ABI, two
    torch installs that share `major.minor` + device but differ in patch /
    CUDA toolkit / ABI resolve to the **same** cached `.so`. The addon built
    against the first torch is then silently reused by the second — an ABI
    mismatch in the DLPack bridge that surfaces as crashes, memory faults,
    or **silently wrong tensor data**, not a clean error.
    
    This is easy to hit whenever `~/.cache/tvm-ffi` is shared across
    environments — a shared/NFS home, or container images that mount the
    host home and see the same cache under different torch builds.
    
    ## Reproduce
    
    ```bash
    # env A
    pip install torch==2.9.0+cu121 --index-url 
https://download.pytorch.org/whl/cu121
    python -c "import tvm_ffi"     # builds 
~/.cache/tvm-ffi/libtorch_c_dlpack_addon_torch29-cuda.so
    
    # env B: same major.minor, different build/ABI, same cache dir
    pip install torch==2.9.1+cu124 --index-url 
https://download.pytorch.org/whl/cu124
    python -c "import tvm_ffi"     # REUSES the cu121-built torch29-cuda.so -> 
ABI mismatch
    ```
    The same applies to any two builds that share
    `torch{major}{minor}-{device}`, including CPU and ROCm builds.
    
    ## Fix
    
    Fold the full torch build identity (`torch.__version__`, which already
    carries patch + `+cuXXX`/`+rocmX.Y`/`+cpu`) and the C++ ABI flag into
    the cached addon name via a short hash. Incompatible builds now get
    distinct cache entries, while same-build reuse is unchanged. The build
    subprocess receives `libname` from this call site, so it stays
    consistent automatically.
    
    Signed-off-by: Piotr Mazurek <[email protected]>
    Co-authored-by: Piotr Mazurek <[email protected]>
---
 python/tvm_ffi/_optional_torch_c_dlpack.py | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/python/tvm_ffi/_optional_torch_c_dlpack.py 
b/python/tvm_ffi/_optional_torch_c_dlpack.py
index 5c804f14..b50cc8bf 100644
--- a/python/tvm_ffi/_optional_torch_c_dlpack.py
+++ b/python/tvm_ffi/_optional_torch_c_dlpack.py
@@ -33,6 +33,7 @@ subsequent calls will be much faster.
 from __future__ import annotations
 
 import ctypes
+import hashlib
 import logging
 import os
 import subprocess
@@ -134,7 +135,17 @@ def load_torch_c_dlpack_extension() -> Any:  # noqa: 
PLR0912, PLR0915
         major, minor = torch.__version__.split(".")[:2]
         device = _torch_extension_device(torch)
         suffix = ".dll" if sys.platform.startswith("win") else ".so"
-        libname = 
f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
+        # The addon is a compiled extension that links libtorch's C++ ABI, so 
its
+        # cache key must capture the full torch build identity -- not just
+        # major.minor + device. ``torch.__version__`` carries the patch 
version and
+        # build tag (e.g. "+cu124", "+rocm6.2", "+cpu"); we also fold in the 
C++ ABI
+        # flag. Without this, two ABI-incompatible torch builds that share
+        # major.minor + device resolve to the same cached ``.so``, and a 
shared cache
+        # directory (NFS home, reused container images) silently loads a 
mismatched
+        # addon -> crashes or wrong tensor data instead of a clean rebuild.
+        abi_id = 
f"{torch.__version__}|cxx11abi={int(torch.compiled_with_cxx11_abi())}"
+        abi_tag = hashlib.sha256(abi_id.encode()).hexdigest()[:8]
+        libname = 
f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}-{abi_tag}{suffix}"
         lib_path = addon_output_dir / libname
         if not lib_path.exists():
             logger.debug("JIT-compiling torch-c-dlpack-ext to cache...")

Reply via email to