This is an automated email from the ASF dual-hosted git repository.
tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git
The following commit(s) were added to refs/heads/main by this push:
new 1fb25db [DOCS] Polish quickstart (#168)
1fb25db is described below
commit 1fb25dbaf173b53ca7f52af2c10adac827923890
Author: Tianqi Chen <[email protected]>
AuthorDate: Fri Oct 17 21:09:27 2025 -0400
[DOCS] Polish quickstart (#168)
Updated examples of c++ bundle, fixed the cmake input.
---
docs/get_started/quickstart.rst | 226 ++++++++++++++++++++++++++----------
examples/quick_start/CMakeLists.txt | 8 +-
2 files changed, 169 insertions(+), 65 deletions(-)
diff --git a/docs/get_started/quickstart.rst b/docs/get_started/quickstart.rst
index 8a03328..3eb02d1 100644
--- a/docs/get_started/quickstart.rst
+++ b/docs/get_started/quickstart.rst
@@ -20,12 +20,12 @@ Quick Start
This guide walks through shipping a minimal ``add_one`` function that computes
``y = x + 1`` in C++ and CUDA.
-
-TVM-FFI's Open ABI and FFI makes it possible to **build once, ship
everywhere**. That said,
-a single shared library works across:
+TVM-FFI's Open ABI and FFI make it possible to **ship one library** for
multiple frameworks and languages.
+We can build a single shared library that works across:
- **ML frameworks**, e.g. PyTorch, JAX, NumPy, CuPy, etc., and
- **languages**, e.g. C++, Python, Rust, etc.
+- **Python ABI versions**, e.g. ship one wheel to support multiple Python
versions, including free-threaded Python.
.. admonition:: Prerequisite
:class: hint
@@ -34,7 +34,7 @@ a single shared library works across:
- Python: 3.9 or newer
- Compiler: C++17-capable toolchain (GCC/Clang/MSVC)
- Optional ML frameworks for testing: NumPy, PyTorch, JAX, CuPy
- - CUDA: Any modern version if you want to try the CUDA part
+ - CUDA: Any modern version (if you want to try the CUDA part)
- TVM-FFI installed via
.. code-block:: bash
@@ -59,7 +59,7 @@ Suppose we implement a C++ function ``AddOne`` that performs
elementwise ``y = x
.. code-block:: cpp
:emphasize-lines: 8, 17
- // File: main.cc
+ // File: add_one_cpu.cc
#include <tvm/ffi/container/tensor.h>
#include <tvm/ffi/function.h>
@@ -75,7 +75,7 @@ Suppose we implement a C++ function ``AddOne`` that performs
elementwise ``y = x
}
}
- TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cpp::AddOne);
+ TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cpu, tvm_ffi_example_cpp::AddOne);
}
@@ -109,28 +109,28 @@ Suppose we implement a C++ function ``AddOne`` that
performs elementwise ``y = x
AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);
}
- TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cuda::AddOne);
+ TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cuda,
tvm_ffi_example_cuda::AddOne);
}
-Macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function
``AddOne`` with public name ``add_one`` in the resulting library.
-TVM-FFI looks it up at runtime to make the function available across languages.
+The macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function
``AddOne``
+as a TVM FFI compatible symbol with the name ``add_one_cpu`` or
``add_one_cuda`` in the resulting library.
-Class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with tensors
from different ML frameworks:
+The class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with
tensors from different ML frameworks:
- NumPy, CuPy,
- PyTorch, JAX, or
- any array type that supports the standard `DLPack protocol
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
-Finally, :cpp:func:`TVMFFIEnvGetStream` used in CUDA code makes it possible to
launch a kernel on caller's stream.
+Finally, :cpp:func:`TVMFFIEnvGetStream` can be used in the CUDA code to launch
a kernel on the caller's stream.
.. _sec-cpp-compile-with-tvm-ffi:
Compile with TVM-FFI
~~~~~~~~~~~~~~~~~~~~
-**Raw command.** Basic command to compile the source code can be as concise as
below:
+**Raw command.** We can use the following minimal commands to compile the
source code:
.. tabs::
@@ -138,28 +138,29 @@ Compile with TVM-FFI
.. code-block:: bash
- g++ -shared -O3 main.cc \
+ g++ -shared -O3 add_one_cpu.cc \
-fPIC -fvisibility=hidden \
`tvm-ffi-config --cxxflags` \
`tvm-ffi-config --ldflags` \
`tvm-ffi-config --libs` \
- -o libmain.so
+ -o add_one_cpu.so
.. group-tab:: CUDA
.. code-block:: bash
- nvcc -shared -O3 main.cu \
+ nvcc -shared -O3 add_one_cuda.cu \
--compiler-options -fPIC \
--compiler-options -fvisibility=hidden \
`tvm-ffi-config --cxxflags` \
`tvm-ffi-config --ldflags` \
`tvm-ffi-config --libs` \
- -o libmain.so
+ -o add_one_cuda.so
-This produces a shared library ``libmain.so``. TVM-FFI automatically embeds
the metadata needed to call the function across language and framework
boundaries.
+This step produces a shared library ``add_one_cpu.so`` and ``add_one_cuda.so``
that can be used across languages and frameworks.
-**CMake.** As the preferred approach to build across platforms, CMake relies
on CMake package ``tvm_ffi``, which can be found via ``tvm-ffi-config
--cmakedir``.
+**CMake.** As the preferred approach for building across platforms,
+CMake relies on the CMake package ``tvm_ffi``, which can be found via
``tvm-ffi-config --cmakedir``.
.. tabs::
@@ -167,88 +168,121 @@ This produces a shared library ``libmain.so``. TVM-FFI
automatically embeds the
.. code-block:: cmake
- # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
find_package(Python COMPONENTS Interpreter REQUIRED)
+ # Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
execute_process(
- COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+ COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
OUTPUT_STRIP_TRAILING_WHITESPACE
OUTPUT_VARIABLE tvm_ffi_ROOT
)
find_package(tvm_ffi CONFIG REQUIRED)
- # Create C++ target `add_one_cpp`
- add_library(add_one_cpp SHARED main.cc)
- target_link_libraries(add_one_cpp PRIVATE tvm_ffi_header)
- target_link_libraries(add_one_cpp PRIVATE tvm_ffi_shared)
+ # Create C++ target `add_one_cpu`
+ add_library(add_one_cpu SHARED add_one_cpu.cc)
+ target_link_libraries(add_one_cpu PRIVATE tvm_ffi_header)
+ target_link_libraries(add_one_cpu PRIVATE tvm_ffi_shared)
+ # show as add_one_cpu.so
+ set_target_properties(add_one_cpu PROPERTIES PREFIX "" SUFFIX ".so")
.. group-tab:: CUDA
.. code-block:: cmake
- # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
find_package(Python COMPONENTS Interpreter REQUIRED)
+ # Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
execute_process(
- COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+ COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
OUTPUT_STRIP_TRAILING_WHITESPACE
OUTPUT_VARIABLE tvm_ffi_ROOT
)
find_package(tvm_ffi CONFIG REQUIRED)
# Create C++ target `add_one_cuda`
enable_language(CUDA)
- add_library(add_one_cuda SHARED main.cu)
+ add_library(add_one_cuda SHARED add_one_cuda.cu)
target_link_libraries(add_one_cuda PRIVATE tvm_ffi_header)
target_link_libraries(add_one_cuda PRIVATE tvm_ffi_shared)
+ # show as add_one_cuda.so
+ set_target_properties(add_one_cuda PROPERTIES PREFIX "" SUFFIX ".so")
.. hint::
- For a single-file C++/CUDA, a convenient method
:py:func:`tvm_ffi.cpp.load_inline`
- is provided to minimize boilerplate code in compilation, linking and
loading.
+ For a single-file C++/CUDA project, a convenient method
:py:func:`tvm_ffi.cpp.load_inline`
+ is provided to minimize boilerplate code in compilation, linking, and
loading.
-Note that ``libmain.so`` is neutral and agnostic to:
+The resulting ``add_one_cpu.so`` and ``add_one_cuda.so`` are minimal libraries
that are agnostic to:
- Python version/ABI, because it is pure C++ and not compiled or linked
against Python
- C++ ABI, because TVM-FFI interacts with the artifact only via stable C APIs
-- Frontend languages, which can be C++, Rust, Python, TypeScript, etc.
+- Languages, which can be C++, Rust or Python.
.. _sec-use-across-framework:
Ship Across ML Frameworks
-------------------------
-TVM FFI's Python package provides :py:func:`tvm_ffi.load_module`, which loads
either C++ or CUDA's ``libmain.so`` into :py:class:`tvm_ffi.Module`.
+TVM-FFI's Python package provides :py:func:`tvm_ffi.load_module`, which can
load either
+the ``add_one_cpu.so`` or ``add_one_cuda.so`` into :py:class:`tvm_ffi.Module`.
.. code-block:: python
import tvm_ffi
- mod : tvm_ffi.Module = tvm_ffi.load_module("libmain.so")
- func : tvm_ffi.Function = mod.add_one
+ mod : tvm_ffi.Module = tvm_ffi.load_module("add_one_cpu.so")
+ func : tvm_ffi.Function = mod.add_one_cpu
+
+``mod.add_one_cpu`` retrieves a callable :py:class:`tvm_ffi.Function` that
accepts tensors from host frameworks
+directly, which can be zero-copy incorporated into all popular ML frameworks.
This process is done seamlessly
+without any boilerplate code and with extremely low latency.
+We can then use these functions in the following ways:
-``mod["add_one"]`` retrieves a callable :py:class:`tvm_ffi.Function` that
accepts tensors from host frameworks directly, which can be zero-copy
incorporated in all popular ML frameworks. This process is done seamlessly
without any boilerplate code, and with ultra low latency.
.. tab-set::
- .. tab-item:: PyTorch (C++/CUDA)
+ .. tab-item:: PyTorch
.. code-block:: python
import torch
- device = "cpu" # or "cuda"
+ # cpu also works by changing the module to add_one_cpu.so and device
to "cpu"
+ mod = tvm_ffi.load_module("add_one_cuda.so")
+ device = "cuda"
x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device=device)
y = torch.empty_like(x)
- func(x, y)
+ mod.add_one_cuda(x, y)
print(y)
- .. tab-item:: JAX (C++/CUDA)
- Upcoming. See `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_
for preview.
+ .. tab-item:: JAX
+
+ Support via `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_
+
+ .. code-block:: python
+
+ import jax
+ import jax.numpy as jnp
+ import jax_tvm_ffi
+ import tvm_ffi
+
+ mod = tvm_ffi.load_module("add_one_cuda.so")
+
+ # Register the function with JAX
+ jax_tvm_ffi.register_ffi_target("add_one_cuda", mod.add_one_cuda,
platform="cuda")
+ x = jnp.array([1.0, 2.0, 3.0], dtype=jnp.float32)
+ y = jax.ffi.ffi_call(
+ "add_one_cuda",
+ jax.ShapeDtypeStruct(x.shape, x.dtype),
+ vmap_method="broadcast_all",
+ )(x)
+ print(y)
- .. tab-item:: NumPy (C++)
+ .. tab-item:: NumPy (CPU)
.. code-block:: python
import numpy as np
+
+ mod = tvm_ffi.load_module("add_one_cpu.so")
x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
y = np.empty_like(x)
- func(x, y)
+ mod.add_one_cpu(x, y)
print(y)
.. tab-item:: CuPy (CUDA)
@@ -256,9 +290,11 @@ TVM FFI's Python package provides
:py:func:`tvm_ffi.load_module`, which loads ei
.. code-block:: python
import cupy as cp
+
+ mod = tvm_ffi.load_module("add_one_cuda.so")
x = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
y = cp.empty_like(x)
- func(x, y)
+ mod.add_one_cuda(x, y)
print(y)
@@ -266,64 +302,126 @@ Ship Across Languages
---------------------
TVM-FFI's core loading mechanism is ABI stable and works across language
boundaries.
-That said, a single artifact can be loaded in every language TVM-FFI supports,
-without having to recompile different artifact targeting different ABIs or
languages.
+A single artifact can be loaded in every language TVM-FFI supports,
+without having to recompile different artifacts targeting different ABIs or
languages.
Python
~~~~~~
-As shown in the :ref:`previous section<sec-use-across-framework>`,
:py:func:`tvm_ffi.load_module` loads a language- and framework-neutral
``libmain.so`` and supports incorporating it into all Python frameworks that
implements the standard `DLPack protocol
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+As shown in the :ref:`previous section<sec-use-across-framework>`,
:py:func:`tvm_ffi.load_module` loads a language-
+and framework-independent ``add_one_cpu.so`` or ``add_one_cuda.so`` and can be
used to incorporate it into all Python
+array frameworks that implement the standard `DLPack protocol
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
C++
~~~
-TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads
``libmain.so`` and can be used directly in C/C++ with no Python dependency.
Note that it is also ABI stable and can be used without having to worry about
C++ compilers and ABIs.
+TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads
``add_one_cpu.so`` or ``add_one_cuda.so`` and
+can be used directly in C/C++ with no Python dependency.
.. code-block:: cpp
- // File: test_load.cc
- #include <tvm/ffi/extra/module.h>
+ // File: run_example.cc
+ #include <tvm/ffi/container/tensor.h>
+ #include <tvm/ffi/extra/module.h>
+
+ namespace ffi = tvm::ffi;
+ struct CPUNDAlloc {
+ void AllocData(DLTensor* tensor) { tensor->data =
malloc(ffi::GetDataSize(*tensor)); }
+ void FreeData(DLTensor* tensor) { free(tensor->data); }
+ };
+
+ inline ffi::Tensor Empty(ffi::Shape shape, DLDataType dtype, DLDevice
device) {
+ return ffi::Tensor::FromNDAlloc(CPUNDAlloc(), shape, dtype, device);
+ }
+
+ int main() {
+ // load the module
+ ffi::Module mod = ffi::Module::LoadFromFile("add_one_cpu.so");
+
+ // create an Tensor, alternatively, one can directly pass in a DLTensor*
+ ffi::Tensor x = Empty({5}, DLDataType({kDLFloat, 32, 1}),
DLDevice({kDLCPU, 0}));
+ for (int i = 0; i < 5; ++i) {
+ reinterpret_cast<float*>(x.data_ptr())[i] = static_cast<float>(i);
+ }
+
+ ffi::Function add_one_cpu = mod->GetFunction("add_one_cpu").value();
+ add_one_cpu(x, x);
- int main() {
- namespace ffi = tvm::ffi;
- ffi::Module mod = ffi::Module::LoadFromFile("libmain.so");
- ffi::Function func = mod->GetFunction("add_one").value();
- return 0;
- }
+ std::cout << "x after add_one_cpu(x, x)" << std::endl;
+ for (int i = 0; i < 5; ++i) {
+ std::cout << reinterpret_cast<float*>(x.data_ptr())[i] << " ";
+ }
+ std::cout << std::endl;
+ return 0;
+ }
Compile it with:
.. code-block:: bash
g++ -fvisibility=hidden -O3 \
- test_load.cc \
+ run_example.cc \
`tvm-ffi-config --cxxflags` \
`tvm-ffi-config --ldflags` \
`tvm-ffi-config --libs` \
-Wl,-rpath,`tvm-ffi-config --libdir` \
- -o test_load
+ -o run_example
+
+ ./run_example
+
+.. hint::
- ./test_load
+ Sometimes it may be desirable to directly bundle the exported module into
the same binary as the main program.
+ In such cases, we can use :cpp:func:`tvm::ffi::Function::FromExternC` to
create a
+ :cpp:class:`tvm::ffi::Function` from the exported symbol, or directly use
+ :cpp:func:`tvm::ffi::Function::InvokeExternC` to invoke the function. This
feature can be useful
+ when the exported module is generated by another DSL compiler matching the
ABI.
+
+ .. code-block:: cpp
+
+ // File: test_bundle.cc, link with libmain.o
+ #include <tvm/ffi/function.h>
+ #include <tvm/ffi/container/tensor.h>
+ // declare reference to the exported symbol
+ extern "C" int __tvm_ffi_add_one(void*, const TVMFFIAny*, int32_t,
TVMFFIAny*);
+
+ namespace ffi = tvm::ffi;
+
+ int bundle_add_one(ffi::TensorView x, ffi::TensorView y) {
+ void* closure_handle = nullptr;
+ ffi::Function::InvokeExternC(closure_handle, __tvm_ffi_add_one, x, y);
+ return 0;
+ }
Rust
~~~~
-TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads ``libmain.so``,
and then retrieves a function ``add_one`` from it. This procedure is strictly
identical to C++ and Python:
+TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads
``add_one_cpu.so`` or ``add_one_cuda.so`` and
+then retrieves a function ``add_one_cpu`` or ``add_one_cuda`` from it.
+This procedure is identical to those in C++ and Python:
.. code-block:: rust
- fn load_add_one() -> Result<tvm_ffi::Function> {
- let module: tvm_ffi::Module =
tvm_ffi::Module::load_from_file("libmain.so")?;
- let result: tvm_ffi::Function = module.get_function("add_one")?;
- Ok(result)
+ fn run_add_one(x: &Tensor, y: &Tensor) -> Result<()> {
+ let module = tvm_ffi::Module::load_from_file("add_one_cpu.so")?;
+ let fn = module.get_function("add_one_cpu")?;
+ let typed_fn = into_typed_fn!(fn, Fn(&Tensor, &Tensor) -> Result<()>);
+ typed_fn(x, y)?;
+ Ok(())
}
+.. hint::
+
+ We can also use the Rust API to target the TVM FFI ABI. This means we can
use Rust to write the function
+ implementation and export to Python/C++ in the same fashion.
+
+
Troubleshooting
---------------
- ``OSError: cannot open shared object file``: Add an rpath (Linux/macOS) or
ensure the DLL is on ``PATH`` (Windows). Example run-path:
``-Wl,-rpath,`tvm-ffi-config --libdir```.
- ``undefined symbol: __tvm_ffi_add_one``: Ensure you used
``TVM_FFI_DLL_EXPORT_TYPED_FUNC`` and compiled with default symbol visibility
(``-fvisibility=hidden`` is fine; the macro ensures export).
-- ``CUDA error: invalid device function``: Rebuild with the right
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
+- ``CUDA error: invalid device function``: Rebuild with the correct
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
diff --git a/examples/quick_start/CMakeLists.txt
b/examples/quick_start/CMakeLists.txt
index b16ab48..0ba983d 100644
--- a/examples/quick_start/CMakeLists.txt
+++ b/examples/quick_start/CMakeLists.txt
@@ -25,7 +25,13 @@ find_package(
REQUIRED
)
-# Find tvm-ffi automatically
+# Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
+execute_process(
+ COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
+ OUTPUT_STRIP_TRAILING_WHITESPACE
+ OUTPUT_VARIABLE tvm_ffi_ROOT
+)
+
find_package(tvm_ffi CONFIG REQUIRED)
# Build the CPU and C versions of the simple "add one" function that the
examples call.