junrushao commented on code in PR #431: URL: https://github.com/apache/tvm-ffi/pull/431#discussion_r2775066389
########## docs/guides/kernel_library_guide.rst: ########## @@ -15,180 +15,343 @@ .. specific language governing permissions and limitations .. under the License. -==================== Kernel Library Guide ==================== -This guide serves as a quick start for shipping kernel libraries with TVM FFI. The shipped kernel libraries are of python version and ML framework agnostic. With the help of TVM FFI, we can connect the kernel libraries to multiple ML framework, such as PyTorch, XLA, JAX, together with the minimal efforts. +This guide covers shipping C++/CUDA kernel libraries with TVM-FFI. The resulting +libraries are agnostic to Python version and ML framework — a single ``.so`` works +with PyTorch, JAX, PaddlePaddle, NumPy, and more. -Tensor -====== +.. seealso:: -Almost all kernel libraries are about tensor computation and manipulation. For better adaptation to different ML frameworks, TVM FFI provides a minimal set of data structures to represent tensors from ML frameworks, including the tensor basic attributes and storage pointer. -To be specific, in TVM FFI, two types of tensor constructs, :cpp:class:`~tvm::ffi::Tensor` and :cpp:class:`~tvm::ffi::TensorView`, can be used to represent a tensor from ML frameworks. + - :doc:`../get_started/quickstart`: End-to-end walkthrough of a simpler ``add_one`` kernel + - :doc:`../packaging/cpp_tooling`: Build toolchain, CMake integration, and library distribution + - All example code in this guide is under + `examples/kernel_library/ <https://github.com/apache/tvm-ffi/tree/main/examples/kernel_library>`_. -Tensor and TensorView ---------------------- -Both :cpp:class:`~tvm::ffi::Tensor` and :cpp:class:`~tvm::ffi::TensorView` are designed to represent tensors from ML frameworks that interact with the TVM FFI ABI. They are backed by the `DLTensor` in DLPack in practice. The main difference is whether it is an owning tensor structure. +Anatomy of a Kernel Function +----------------------------- -:cpp:class:`tvm::ffi::Tensor` - :cpp:class:`~tvm::ffi::Tensor` is a completely owning tensor with reference counting. It can be created on either C++ or Python side and passed between either side. And TVM FFI internally keeps a reference count to track lifetime of the tensors. When the reference count goes to zero, its underlying deleter function will be called to free the tensor storage. +Every TVM-FFI CUDA kernel follows the same sequence: -:cpp:class:`tvm::ffi::TensorView` - :cpp:class:`~tvm::ffi::TensorView` is a non-owning view of an existing tensor, pointing to an existing tensor (e.g., a tensor allocated by PyTorch). +1. **Validate** inputs (device, dtype, shape, contiguity) +2. **Set device guard** to match the tensor's device +3. **Acquire stream** from the host framework +4. **Dispatch** on dtype and **launch** the kernel -It is **recommended** to use :cpp:class:`~tvm::ffi::TensorView` when possible, that helps us to support more cases, including cases where only view but not strong reference are passed, like XLA buffer. It is also more lightweight. However, since :cpp:class:`~tvm::ffi::TensorView` is a non-owning view, it is the user's responsibility to ensure the lifetime of underlying tensor. +Here is a complete ``Scale`` kernel that computes ``y = x * factor``: -Tensor Attributes ------------------ +.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu + :language: cpp + :start-after: [function.begin] + :end-before: [function.end] -For convenience, :cpp:class:`~tvm::ffi::TensorView` and :cpp:class:`~tvm::ffi::Tensor` align the following attributes retrieval mehtods to :cpp:class:`at::Tensor` interface, to obtain tensor basic attributes and storage pointer: -``dim``, ``dtype``, ``sizes``, ``size``, ``strides``, ``stride``, ``numel``, ``data_ptr``, ``device``, ``is_contiguous`` +The CUDA kernel itself is a standard ``__global__`` function: -Please refer to the documentation of both tensor classes for their details. Here highlight some non-primitive attributes: +.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu + :language: cpp + :start-after: [cuda_kernel.begin] + :end-before: [cuda_kernel.end] -:c:struct:`DLDataType` - The ``dtype`` of the tensor. It's represented by a struct with three fields: code, bits, and lanes, defined by DLPack protocol. +The following subsections break down each step. -:c:struct:`DLDevice` - The ``device`` where the tensor is stored. It is represented by a struct with two fields: device_type and device_id, defined by DLPack protocol. -:cpp:class:`tvm::ffi::ShapeView` - The ``sizes`` and ``strides`` attributes retrieval are returned as :cpp:class:`~tvm::ffi::ShapeView`. It is an iterate-able data structure storing the shapes or strides data as ``int64_t`` array. +Input Validation +~~~~~~~~~~~~~~~~ -Tensor Allocation ------------------ +Kernel functions should validate inputs early and fail with clear error messages. +A common pattern is to define reusable ``CHECK_*`` macros on top of +:c:macro:`TVM_FFI_CHECK` (see :doc:`../concepts/exception_handling`): + +.. literalinclude:: ../../examples/kernel_library/tvm_ffi_utils.h + :language: cpp + :start-after: [check_macros.begin] + :end-before: [check_macros.end] + +For **user-facing errors** (bad arguments, unsupported dtypes, shape mismatches), +use :c:macro:`TVM_FFI_THROW` or :c:macro:`TVM_FFI_CHECK` with a specific error kind +so that callers receive an actionable message: + +.. code-block:: cpp + + TVM_FFI_THROW(TypeError) << "Unsupported dtype: " << input.dtype(); + TVM_FFI_CHECK(input.numel() > 0, ValueError) << "input must be non-empty"; + TVM_FFI_CHECK(input.numel() == output.numel(), ValueError) << "size mismatch"; + +For **internal invariants** that indicate bugs in the kernel itself, use +:c:macro:`TVM_FFI_ICHECK`: + +.. code-block:: cpp + + TVM_FFI_ICHECK_GE(n, 0) << "element count must be non-negative"; + + +Device Guard and Stream +~~~~~~~~~~~~~~~~~~~~~~~ + +Before launching a CUDA kernel, two things must happen: + +1. **Set the CUDA device** to match the tensor's device. :cpp:class:`tvm::ffi::CUDADeviceGuard` + is an RAII guard that calls ``cudaSetDevice`` on construction and restores the + original device on destruction. + +2. **Acquire the stream** from the host framework via :cpp:func:`TVMFFIEnvGetStream`. + When Python code calls a kernel with PyTorch tensors, TVM-FFI automatically + captures PyTorch's current stream for the tensor's device. + +A small helper keeps this concise: + +.. literalinclude:: ../../examples/kernel_library/tvm_ffi_utils.h + :language: cpp + :start-after: [get_stream.begin] + :end-before: [get_stream.end] + +Every kernel function then follows the same two-line pattern: + +.. code-block:: cpp + + ffi::CUDADeviceGuard guard(input.device().device_id); + cudaStream_t stream = get_cuda_stream(input.device()); + +See :doc:`../concepts/tensor` for details on stream handling and automatic stream +context updates. -TVM FFI provides several methods to create or allocate tensors at C++ runtime. Generally, there are two types of tensor creation methods: -* Allocate a tensor with new storage from scratch, i.e. :cpp:func:`~tvm::ffi::Tensor::FromEnvAlloc` and :cpp:func:`~tvm::ffi::Tensor::FromNDAlloc`. By this types of methods, the shapes, strides, data types, devices and other attributes are required for the allocation. -* Create a tensor with existing storage following DLPack protocol, i.e. :cpp:func:`~tvm::ffi::Tensor::FromDLPack` and :cpp:func:`~tvm::ffi::Tensor::FromDLPackVersioned`. By this types of methods, the shapes, data types, devices and other attributes can be inferred from the DLPack attributes. +Dtype Dispatch +~~~~~~~~~~~~~~ -FromEnvAlloc -^^^^^^^^^^^^ +Kernels typically support multiple dtypes. Dispatch on :c:struct:`DLDataType` at +runtime while instantiating templates at compile time: -To better adapt to the ML framework, it is **recommended** to reuse the framework tensor allocator anyway, instead of directly allocating the tensors via CUDA runtime API, like ``cudaMalloc``. Since reusing the framework tensor allocator: +.. code-block:: cpp -* Benefit from the framework's native caching allocator or related allocation mechanism. -* Help framework tracking memory usage and planning globally. + constexpr DLDataType dl_float32 = DLDataType{kDLFloat, 32, 1}; + constexpr DLDataType dl_float16 = DLDataType{kDLFloat, 16, 1}; -TVM FFI provides :cpp:func:`tvm::ffi::Tensor::FromEnvAlloc` to allocate a tensor with the framework tensor allocator. To determine which framework tensor allocator, TVM FFI infers it from the passed-in framework tensors. For example, when calling the kernel library at Python side, there is an input framework tensor if of type ``torch.Tensor``, TVM FFI will automatically bind the :cpp:func:`at::empty` as the current framework tensor allocator by ``TVMFFIEnvTensorAlloc``. And then the :cpp:func:`~tvm::ffi::Tensor::FromEnvAlloc` is calling the :cpp:class:`at::empty` actually: + if (input.dtype() == dl_float32) { + ScaleKernel<<<blocks, threads, 0, stream>>>( + static_cast<float*>(output.data_ptr()), ...); + } else if (input.dtype() == dl_float16) { + ScaleKernel<<<blocks, threads, 0, stream>>>( + static_cast<half*>(output.data_ptr()), ...); + } else { + TVM_FFI_THROW(TypeError) << "Unsupported dtype: " << input.dtype(); + } -.. code-block:: c++ +For libraries that support many dtypes, define dispatch macros +(see `FlashInfer's tvm_ffi_utils.h <https://github.com/flashinfer-ai/flashinfer/blob/main/csrc/tvm_ffi_utils.h>`_ +for a production example). - ffi::Tensor tensor = ffi::Tensor::FromEnvAlloc(TVMFFIEnvTensorAlloc, ...); -which is equivalent to: +Export and Load +--------------- -.. code-block:: c++ +Export and Build +~~~~~~~~~~~~~~~~ - at::Tensor tensor = at::empty(...); +**Export.** Use :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` to create a C symbol +that follows the :doc:`TVM-FFI calling convention <../concepts/func_module>`: -FromNDAlloc -^^^^^^^^^^^ +.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu + :language: cpp + :start-after: [export.begin] + :end-before: [export.end] -:cpp:func:`tvm::ffi::Tensor::FromNDAlloc` can be used to create a tensor with custom memory allocator. It is of simple usage by providing a custom memory allocator and deleter for tensor allocation and free each, rather than relying on any framework tensor allocator. +This creates a symbol ``__tvm_ffi_scale`` in the shared library. For larger +projects, a common pattern is to keep kernel implementations and export +declarations in **separate files** — an implementation file and a binding file: Review Comment: Opus 4.6 hallucinates 🫠 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
