This is an automated email from the ASF dual-hosted git repository.
cyx-6 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git
The following commit(s) were added to refs/heads/main by this push:
new 230ae6c8 Include torch build/ABI in torch C-DLPack addon cache key
(#644)
230ae6c8 is described below
commit 230ae6c86ababb170baac6c485214e0a5af30702
Author: Piotr Mazurek <[email protected]>
AuthorDate: Mon Jun 29 17:39:03 2026 +0200
Include torch build/ABI in torch C-DLPack addon cache key (#644)
## Problem
The prebuilt torch C-DLPack addon is cached under a filename derived
only from torch **major.minor** + a coarse device string:
```python
major, minor = torch.__version__.split(".")[:2]
device = _torch_extension_device(torch) # "cuda" / "rocm" / "cpu"
libname = f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
lib_path = cache_dir / libname # cache_dir defaults to
~/.cache/tvm-ffi
if not lib_path.exists():
...build... # otherwise reuse whatever
is there
```
This key omits:
- the torch **patch** version (`2.9.0` vs `2.9.1`),
- the build local-version tag carried in `torch.__version__` (`+cu121`
vs `+cu124`, `+cpu`, …),
- the C++ ABI flag (`torch._C._GLIBCXX_USE_CXX11_ABI`).
Since the addon is a compiled extension linking libtorch's C++ ABI, two
torch installs that share `major.minor` + device but differ in patch /
CUDA toolkit / ABI resolve to the **same** cached `.so`. The addon built
against the first torch is then silently reused by the second — an ABI
mismatch in the DLPack bridge that surfaces as crashes, memory faults,
or **silently wrong tensor data**, not a clean error.
This is easy to hit whenever `~/.cache/tvm-ffi` is shared across
environments — a shared/NFS home, or container images that mount the
host home and see the same cache under different torch builds.
## Reproduce
```bash
# env A
pip install torch==2.9.0+cu121 --index-url
https://download.pytorch.org/whl/cu121
python -c "import tvm_ffi" # builds
~/.cache/tvm-ffi/libtorch_c_dlpack_addon_torch29-cuda.so
# env B: same major.minor, different build/ABI, same cache dir
pip install torch==2.9.1+cu124 --index-url
https://download.pytorch.org/whl/cu124
python -c "import tvm_ffi" # REUSES the cu121-built torch29-cuda.so ->
ABI mismatch
```
The same applies to any two builds that share
`torch{major}{minor}-{device}`, including CPU and ROCm builds.
## Fix
Fold the full torch build identity (`torch.__version__`, which already
carries patch + `+cuXXX`/`+rocmX.Y`/`+cpu`) and the C++ ABI flag into
the cached addon name via a short hash. Incompatible builds now get
distinct cache entries, while same-build reuse is unchanged. The build
subprocess receives `libname` from this call site, so it stays
consistent automatically.
Signed-off-by: Piotr Mazurek <[email protected]>
Co-authored-by: Piotr Mazurek <[email protected]>
---
python/tvm_ffi/_optional_torch_c_dlpack.py | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/python/tvm_ffi/_optional_torch_c_dlpack.py
b/python/tvm_ffi/_optional_torch_c_dlpack.py
index 5c804f14..b50cc8bf 100644
--- a/python/tvm_ffi/_optional_torch_c_dlpack.py
+++ b/python/tvm_ffi/_optional_torch_c_dlpack.py
@@ -33,6 +33,7 @@ subsequent calls will be much faster.
from __future__ import annotations
import ctypes
+import hashlib
import logging
import os
import subprocess
@@ -134,7 +135,17 @@ def load_torch_c_dlpack_extension() -> Any: # noqa:
PLR0912, PLR0915
major, minor = torch.__version__.split(".")[:2]
device = _torch_extension_device(torch)
suffix = ".dll" if sys.platform.startswith("win") else ".so"
- libname =
f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}{suffix}"
+ # The addon is a compiled extension that links libtorch's C++ ABI, so
its
+ # cache key must capture the full torch build identity -- not just
+ # major.minor + device. ``torch.__version__`` carries the patch
version and
+ # build tag (e.g. "+cu124", "+rocm6.2", "+cpu"); we also fold in the
C++ ABI
+ # flag. Without this, two ABI-incompatible torch builds that share
+ # major.minor + device resolve to the same cached ``.so``, and a
shared cache
+ # directory (NFS home, reused container images) silently loads a
mismatched
+ # addon -> crashes or wrong tensor data instead of a clean rebuild.
+ abi_id =
f"{torch.__version__}|cxx11abi={int(torch.compiled_with_cxx11_abi())}"
+ abi_tag = hashlib.sha256(abi_id.encode()).hexdigest()[:8]
+ libname =
f"libtorch_c_dlpack_addon_torch{major}{minor}-{device}-{abi_tag}{suffix}"
lib_path = addon_output_dir / libname
if not lib_path.exists():
logger.debug("JIT-compiling torch-c-dlpack-ext to cache...")