This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git


The following commit(s) were added to refs/heads/main by this push:
     new 1fb25db  [DOCS] Polish quickstart (#168)
1fb25db is described below

commit 1fb25dbaf173b53ca7f52af2c10adac827923890
Author: Tianqi Chen <[email protected]>
AuthorDate: Fri Oct 17 21:09:27 2025 -0400

    [DOCS] Polish quickstart (#168)
    
    Updated examples of c++ bundle, fixed the cmake input.
---
 docs/get_started/quickstart.rst     | 226 ++++++++++++++++++++++++++----------
 examples/quick_start/CMakeLists.txt |   8 +-
 2 files changed, 169 insertions(+), 65 deletions(-)

diff --git a/docs/get_started/quickstart.rst b/docs/get_started/quickstart.rst
index 8a03328..3eb02d1 100644
--- a/docs/get_started/quickstart.rst
+++ b/docs/get_started/quickstart.rst
@@ -20,12 +20,12 @@ Quick Start
 
 This guide walks through shipping a minimal ``add_one`` function that computes
 ``y = x + 1`` in C++ and CUDA.
-
-TVM-FFI's Open ABI and FFI makes it possible to **build once, ship 
everywhere**. That said,
-a single shared library works across:
+TVM-FFI's Open ABI and FFI make it possible to **ship one library** for 
multiple frameworks and languages.
+We can build a single shared library that works across:
 
 - **ML frameworks**, e.g. PyTorch, JAX, NumPy, CuPy, etc., and
 - **languages**, e.g. C++, Python, Rust, etc.
+- **Python ABI versions**, e.g. ship one wheel to support multiple Python 
versions, including free-threaded Python.
 
 .. admonition:: Prerequisite
    :class: hint
@@ -34,7 +34,7 @@ a single shared library works across:
    - Python: 3.9 or newer
    - Compiler: C++17-capable toolchain (GCC/Clang/MSVC)
    - Optional ML frameworks for testing: NumPy, PyTorch, JAX, CuPy
-   - CUDA: Any modern version if you want to try the CUDA part
+   - CUDA: Any modern version (if you want to try the CUDA part)
    - TVM-FFI installed via
 
      .. code-block:: bash
@@ -59,7 +59,7 @@ Suppose we implement a C++ function ``AddOne`` that performs 
elementwise ``y = x
     .. code-block:: cpp
       :emphasize-lines: 8, 17
 
-      // File: main.cc
+      // File: add_one_cpu.cc
       #include <tvm/ffi/container/tensor.h>
       #include <tvm/ffi/function.h>
 
@@ -75,7 +75,7 @@ Suppose we implement a C++ function ``AddOne`` that performs 
elementwise ``y = x
         }
       }
 
-      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cpp::AddOne);
+      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cpu, tvm_ffi_example_cpp::AddOne);
       }
 
 
@@ -109,28 +109,28 @@ Suppose we implement a C++ function ``AddOne`` that 
performs elementwise ``y = x
         AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);
       }
 
-      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cuda::AddOne);
+      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cuda, 
tvm_ffi_example_cuda::AddOne);
       }
 
 
 
-Macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function 
``AddOne`` with public name ``add_one`` in the resulting library.
-TVM-FFI looks it up at runtime to make the function available across languages.
+The macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function 
``AddOne``
+as a TVM FFI compatible symbol with the name ``add_one_cpu`` or 
``add_one_cuda`` in the resulting library.
 
-Class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with tensors 
from different ML frameworks:
+The class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with 
tensors from different ML frameworks:
 
 - NumPy, CuPy,
 - PyTorch, JAX, or
 - any array type that supports the standard `DLPack protocol 
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
 
-Finally, :cpp:func:`TVMFFIEnvGetStream` used in CUDA code makes it possible to 
launch a kernel on caller's stream.
+Finally, :cpp:func:`TVMFFIEnvGetStream` can be used in the CUDA code to launch 
a kernel on the caller's stream.
 
 .. _sec-cpp-compile-with-tvm-ffi:
 
 Compile with TVM-FFI
 ~~~~~~~~~~~~~~~~~~~~
 
-**Raw command.** Basic command to compile the source code can be as concise as 
below:
+**Raw command.** We can use the following minimal commands to compile the 
source code:
 
 .. tabs::
 
@@ -138,28 +138,29 @@ Compile with TVM-FFI
 
     .. code-block:: bash
 
-      g++ -shared -O3 main.cc                   \
+      g++ -shared -O3 add_one_cpu.cc                   \
           -fPIC -fvisibility=hidden             \
           `tvm-ffi-config --cxxflags`           \
           `tvm-ffi-config --ldflags`            \
           `tvm-ffi-config --libs`               \
-          -o libmain.so
+          -o add_one_cpu.so
 
   .. group-tab:: CUDA
 
     .. code-block:: bash
 
-      nvcc -shared -O3 main.cu                  \
+      nvcc -shared -O3 add_one_cuda.cu                  \
         --compiler-options -fPIC                \
         --compiler-options -fvisibility=hidden  \
         `tvm-ffi-config --cxxflags`             \
         `tvm-ffi-config --ldflags`              \
         `tvm-ffi-config --libs`                 \
-        -o libmain.so
+        -o add_one_cuda.so
 
-This produces a shared library ``libmain.so``. TVM-FFI automatically embeds 
the metadata needed to call the function across language and framework 
boundaries.
+This step produces a shared library ``add_one_cpu.so`` and ``add_one_cuda.so`` 
that can be used across languages and frameworks.
 
-**CMake.** As the preferred approach to build across platforms, CMake relies 
on CMake package ``tvm_ffi``, which can be found via ``tvm-ffi-config 
--cmakedir``.
+**CMake.** As the preferred approach for building across platforms,
+CMake relies on the CMake package ``tvm_ffi``, which can be found via 
``tvm-ffi-config --cmakedir``.
 
 .. tabs::
 
@@ -167,88 +168,121 @@ This produces a shared library ``libmain.so``. TVM-FFI 
automatically embeds the
 
     .. code-block:: cmake
 
-      # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
       find_package(Python COMPONENTS Interpreter REQUIRED)
+      # Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
       execute_process(
-        COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+        COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
         OUTPUT_STRIP_TRAILING_WHITESPACE
         OUTPUT_VARIABLE tvm_ffi_ROOT
       )
       find_package(tvm_ffi CONFIG REQUIRED)
-      # Create C++ target `add_one_cpp`
-      add_library(add_one_cpp SHARED main.cc)
-      target_link_libraries(add_one_cpp PRIVATE tvm_ffi_header)
-      target_link_libraries(add_one_cpp PRIVATE tvm_ffi_shared)
+      # Create C++ target `add_one_cpu`
+      add_library(add_one_cpu SHARED add_one_cpu.cc)
+      target_link_libraries(add_one_cpu PRIVATE tvm_ffi_header)
+      target_link_libraries(add_one_cpu PRIVATE tvm_ffi_shared)
+      # show as add_one_cpu.so
+      set_target_properties(add_one_cpu PROPERTIES PREFIX "" SUFFIX ".so")
 
   .. group-tab:: CUDA
 
     .. code-block:: cmake
 
-      # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
       find_package(Python COMPONENTS Interpreter REQUIRED)
+      # Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
       execute_process(
-        COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+        COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
         OUTPUT_STRIP_TRAILING_WHITESPACE
         OUTPUT_VARIABLE tvm_ffi_ROOT
       )
       find_package(tvm_ffi CONFIG REQUIRED)
       # Create C++ target `add_one_cuda`
       enable_language(CUDA)
-      add_library(add_one_cuda SHARED main.cu)
+      add_library(add_one_cuda SHARED add_one_cuda.cu)
       target_link_libraries(add_one_cuda PRIVATE tvm_ffi_header)
       target_link_libraries(add_one_cuda PRIVATE tvm_ffi_shared)
+      # show as add_one_cuda.so
+      set_target_properties(add_one_cuda PROPERTIES PREFIX "" SUFFIX ".so")
 
 .. hint::
 
-   For a single-file C++/CUDA, a convenient method 
:py:func:`tvm_ffi.cpp.load_inline`
-   is provided to minimize boilerplate code in compilation, linking and 
loading.
+   For a single-file C++/CUDA project, a convenient method 
:py:func:`tvm_ffi.cpp.load_inline`
+   is provided to minimize boilerplate code in compilation, linking, and 
loading.
 
-Note that ``libmain.so`` is neutral and agnostic to:
+The resulting ``add_one_cpu.so`` and ``add_one_cuda.so`` are minimal libraries 
that are agnostic to:
 
 - Python version/ABI, because it is pure C++ and not compiled or linked 
against Python
 - C++ ABI, because TVM-FFI interacts with the artifact only via stable C APIs
-- Frontend languages, which can be C++, Rust, Python, TypeScript, etc.
+- Languages, which can be C++, Rust or Python.
 
 .. _sec-use-across-framework:
 
 Ship Across ML Frameworks
 -------------------------
 
-TVM FFI's Python package provides :py:func:`tvm_ffi.load_module`, which loads 
either C++ or CUDA's ``libmain.so`` into :py:class:`tvm_ffi.Module`.
+TVM-FFI's Python package provides :py:func:`tvm_ffi.load_module`, which can 
load either
+the ``add_one_cpu.so`` or ``add_one_cuda.so`` into :py:class:`tvm_ffi.Module`.
 
 .. code-block:: python
 
    import tvm_ffi
-   mod  : tvm_ffi.Module   = tvm_ffi.load_module("libmain.so")
-   func : tvm_ffi.Function = mod.add_one
+   mod  : tvm_ffi.Module   = tvm_ffi.load_module("add_one_cpu.so")
+   func : tvm_ffi.Function = mod.add_one_cpu
+
+``mod.add_one_cpu`` retrieves a callable :py:class:`tvm_ffi.Function` that 
accepts tensors from host frameworks
+directly, which can be zero-copy incorporated into all popular ML frameworks. 
This process is done seamlessly
+without any boilerplate code and with extremely low latency.
+We can then use these functions in the following ways:
 
-``mod["add_one"]`` retrieves a callable :py:class:`tvm_ffi.Function` that 
accepts tensors from host frameworks directly, which can be zero-copy 
incorporated in all popular ML frameworks. This process is done seamlessly 
without any boilerplate code, and with ultra low latency.
 
 .. tab-set::
 
-    .. tab-item:: PyTorch (C++/CUDA)
+    .. tab-item:: PyTorch
 
         .. code-block:: python
 
           import torch
-          device = "cpu" # or "cuda"
+          # cpu also works by changing the module to add_one_cpu.so and device 
to "cpu"
+          mod = tvm_ffi.load_module("add_one_cuda.so")
+          device = "cuda"
           x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device=device)
           y = torch.empty_like(x)
-          func(x, y)
+          mod.add_one_cuda(x, y)
           print(y)
 
-    .. tab-item:: JAX (C++/CUDA)
 
-        Upcoming. See `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_ 
for preview.
+    .. tab-item:: JAX
+
+        Support via `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_
+
+        .. code-block:: python
+
+          import jax
+          import jax.numpy as jnp
+          import jax_tvm_ffi
+          import tvm_ffi
+
+          mod = tvm_ffi.load_module("add_one_cuda.so")
+
+          # Register the function with JAX
+          jax_tvm_ffi.register_ffi_target("add_one_cuda", mod.add_one_cuda, 
platform="cuda")
+          x = jnp.array([1.0, 2.0, 3.0], dtype=jnp.float32)
+          y = jax.ffi.ffi_call(
+              "add_one_cuda",
+              jax.ShapeDtypeStruct(x.shape, x.dtype),
+              vmap_method="broadcast_all",
+          )(x)
+          print(y)
 
-    .. tab-item:: NumPy (C++)
+    .. tab-item:: NumPy (CPU)
 
         .. code-block:: python
 
           import numpy as np
+
+          mod = tvm_ffi.load_module("add_one_cpu.so")
           x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
           y = np.empty_like(x)
-          func(x, y)
+          mod.add_one_cpu(x, y)
           print(y)
 
     .. tab-item:: CuPy (CUDA)
@@ -256,9 +290,11 @@ TVM FFI's Python package provides 
:py:func:`tvm_ffi.load_module`, which loads ei
         .. code-block:: python
 
           import cupy as cp
+
+          mod = tvm_ffi.load_module("add_one_cuda.so")
           x = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
           y = cp.empty_like(x)
-          func(x, y)
+          mod.add_one_cuda(x, y)
           print(y)
 
 
@@ -266,64 +302,126 @@ Ship Across Languages
 ---------------------
 
 TVM-FFI's core loading mechanism is ABI stable and works across language 
boundaries.
-That said, a single artifact can be loaded in every language TVM-FFI supports,
-without having to recompile different artifact targeting different ABIs or 
languages.
+A single artifact can be loaded in every language TVM-FFI supports,
+without having to recompile different artifacts targeting different ABIs or 
languages.
 
 
 Python
 ~~~~~~
 
-As shown in the :ref:`previous section<sec-use-across-framework>`, 
:py:func:`tvm_ffi.load_module` loads a language- and framework-neutral 
``libmain.so`` and supports incorporating it into all Python frameworks that 
implements the standard `DLPack protocol 
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+As shown in the :ref:`previous section<sec-use-across-framework>`, 
:py:func:`tvm_ffi.load_module` loads a language-
+and framework-independent ``add_one_cpu.so`` or ``add_one_cuda.so`` and can be 
used to incorporate it into all Python
+array frameworks that implement the standard `DLPack protocol 
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
 
 C++
 ~~~
 
-TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads 
``libmain.so`` and can be used directly in C/C++ with no Python dependency. 
Note that it is also ABI stable and can be used without having to worry about 
C++ compilers and ABIs.
+TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads 
``add_one_cpu.so`` or ``add_one_cuda.so`` and
+can be used directly in C/C++ with no Python dependency.
 
 .. code-block:: cpp
 
-   // File: test_load.cc
-   #include <tvm/ffi/extra/module.h>
+  // File: run_example.cc
+  #include <tvm/ffi/container/tensor.h>
+  #include <tvm/ffi/extra/module.h>
+
+  namespace ffi = tvm::ffi;
+  struct CPUNDAlloc {
+    void AllocData(DLTensor* tensor) { tensor->data = 
malloc(ffi::GetDataSize(*tensor)); }
+    void FreeData(DLTensor* tensor) { free(tensor->data); }
+  };
+
+  inline ffi::Tensor Empty(ffi::Shape shape, DLDataType dtype, DLDevice 
device) {
+    return ffi::Tensor::FromNDAlloc(CPUNDAlloc(), shape, dtype, device);
+  }
+
+  int main() {
+    // load the module
+    ffi::Module mod = ffi::Module::LoadFromFile("add_one_cpu.so");
+
+    // create an Tensor, alternatively, one can directly pass in a DLTensor*
+    ffi::Tensor x = Empty({5}, DLDataType({kDLFloat, 32, 1}), 
DLDevice({kDLCPU, 0}));
+    for (int i = 0; i < 5; ++i) {
+      reinterpret_cast<float*>(x.data_ptr())[i] = static_cast<float>(i);
+    }
+
+    ffi::Function add_one_cpu = mod->GetFunction("add_one_cpu").value();
+    add_one_cpu(x, x);
 
-   int main() {
-     namespace ffi = tvm::ffi;
-     ffi::Module   mod  = ffi::Module::LoadFromFile("libmain.so");
-     ffi::Function func = mod->GetFunction("add_one").value();
-     return 0;
-   }
+    std::cout << "x after add_one_cpu(x, x)" << std::endl;
+    for (int i = 0; i < 5; ++i) {
+      std::cout << reinterpret_cast<float*>(x.data_ptr())[i] << " ";
+    }
+    std::cout << std::endl;
+    return 0;
+  }
 
 Compile it with:
 
 .. code-block:: bash
 
     g++ -fvisibility=hidden -O3               \
-        test_load.cc                          \
+        run_example.cc                        \
         `tvm-ffi-config --cxxflags`           \
         `tvm-ffi-config --ldflags`            \
         `tvm-ffi-config --libs`               \
         -Wl,-rpath,`tvm-ffi-config --libdir`  \
-        -o test_load
+        -o run_example
+
+    ./run_example
+
+.. hint::
 
-    ./test_load
+  Sometimes it may be desirable to directly bundle the exported module into 
the same binary as the main program.
+  In such cases, we can use :cpp:func:`tvm::ffi::Function::FromExternC` to 
create a
+  :cpp:class:`tvm::ffi::Function` from the exported symbol, or directly use
+  :cpp:func:`tvm::ffi::Function::InvokeExternC` to invoke the function. This 
feature can be useful
+  when the exported module is generated by another DSL compiler matching the 
ABI.
+
+  .. code-block:: cpp
+
+      // File: test_bundle.cc, link with libmain.o
+      #include <tvm/ffi/function.h>
+      #include <tvm/ffi/container/tensor.h>
 
+      // declare reference to the exported symbol
+      extern "C" int __tvm_ffi_add_one(void*, const TVMFFIAny*, int32_t, 
TVMFFIAny*);
+
+      namespace ffi = tvm::ffi;
+
+      int bundle_add_one(ffi::TensorView x, ffi::TensorView y) {
+        void* closure_handle = nullptr;
+        ffi::Function::InvokeExternC(closure_handle, __tvm_ffi_add_one, x, y);
+        return 0;
+      }
 
 Rust
 ~~~~
 
-TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads ``libmain.so``, 
and then retrieves a function ``add_one`` from it. This procedure is strictly 
identical to C++ and Python:
+TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads 
``add_one_cpu.so`` or ``add_one_cuda.so`` and
+then retrieves a function ``add_one_cpu`` or ``add_one_cuda`` from it.
+This procedure is identical to those in C++ and Python:
 
 .. code-block:: rust
 
-    fn load_add_one() -> Result<tvm_ffi::Function> {
-        let module: tvm_ffi::Module = 
tvm_ffi::Module::load_from_file("libmain.so")?;
-        let result: tvm_ffi::Function = module.get_function("add_one")?;
-        Ok(result)
+    fn run_add_one(x: &Tensor, y: &Tensor) -> Result<()> {
+        let module = tvm_ffi::Module::load_from_file("add_one_cpu.so")?;
+        let fn = module.get_function("add_one_cpu")?;
+        let typed_fn = into_typed_fn!(fn, Fn(&Tensor, &Tensor) -> Result<()>);
+        typed_fn(x, y)?;
+        Ok(())
     }
 
 
+.. hint::
+
+    We can also use the Rust API to target the TVM FFI ABI. This means we can 
use Rust to write the function
+    implementation and export to Python/C++ in the same fashion.
+
+
 Troubleshooting
 ---------------
 
 - ``OSError: cannot open shared object file``: Add an rpath (Linux/macOS) or 
ensure the DLL is on ``PATH`` (Windows). Example run-path: 
``-Wl,-rpath,`tvm-ffi-config --libdir```.
 - ``undefined symbol: __tvm_ffi_add_one``: Ensure you used 
``TVM_FFI_DLL_EXPORT_TYPED_FUNC`` and compiled with default symbol visibility 
(``-fvisibility=hidden`` is fine; the macro ensures export).
-- ``CUDA error: invalid device function``: Rebuild with the right 
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
+- ``CUDA error: invalid device function``: Rebuild with the correct 
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
diff --git a/examples/quick_start/CMakeLists.txt 
b/examples/quick_start/CMakeLists.txt
index b16ab48..0ba983d 100644
--- a/examples/quick_start/CMakeLists.txt
+++ b/examples/quick_start/CMakeLists.txt
@@ -25,7 +25,13 @@ find_package(
   REQUIRED
 )
 
-# Find tvm-ffi automatically
+# Run `tvm_ffi.config --cmakedir` to find tvm-ffi targets
+execute_process(
+  COMMAND "${Python_EXECUTABLE}" -m tvm_ffi.config --cmakedir
+  OUTPUT_STRIP_TRAILING_WHITESPACE
+  OUTPUT_VARIABLE tvm_ffi_ROOT
+)
+
 find_package(tvm_ffi CONFIG REQUIRED)
 
 # Build the CPU and C versions of the simple "add one" function that the 
examples call.

Reply via email to