junrushao commented on code in PR #431:
URL: https://github.com/apache/tvm-ffi/pull/431#discussion_r2775066389


##########
docs/guides/kernel_library_guide.rst:
##########
@@ -15,180 +15,343 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
-====================
 Kernel Library Guide
 ====================
 
-This guide serves as a quick start for shipping kernel libraries with TVM FFI. 
The shipped kernel libraries are of python version and ML framework agnostic. 
With the help of TVM FFI, we can connect the kernel libraries to multiple ML 
framework, such as PyTorch, XLA, JAX, together with the minimal efforts.
+This guide covers shipping C++/CUDA kernel libraries with TVM-FFI. The 
resulting
+libraries are agnostic to Python version and ML framework — a single ``.so`` 
works
+with PyTorch, JAX, PaddlePaddle, NumPy, and more.
 
-Tensor
-======
+.. seealso::
 
-Almost all kernel libraries are about tensor computation and manipulation. For 
better adaptation to different ML frameworks, TVM FFI provides a minimal set of 
data structures to represent tensors from ML frameworks, including the tensor 
basic attributes and storage pointer.
-To be specific, in TVM FFI, two types of tensor constructs, 
:cpp:class:`~tvm::ffi::Tensor` and :cpp:class:`~tvm::ffi::TensorView`, can be 
used to represent a tensor from ML frameworks.
+   - :doc:`../get_started/quickstart`: End-to-end walkthrough of a simpler 
``add_one`` kernel
+   - :doc:`../packaging/cpp_tooling`: Build toolchain, CMake integration, and 
library distribution
+   - All example code in this guide is under
+     `examples/kernel_library/ 
<https://github.com/apache/tvm-ffi/tree/main/examples/kernel_library>`_.
 
-Tensor and TensorView
----------------------
 
-Both :cpp:class:`~tvm::ffi::Tensor` and :cpp:class:`~tvm::ffi::TensorView` are 
designed to represent tensors from ML frameworks that interact with the TVM FFI 
ABI. They are backed by the `DLTensor` in DLPack in practice. The main 
difference is whether it is an owning tensor structure.
+Anatomy of a Kernel Function
+-----------------------------
 
-:cpp:class:`tvm::ffi::Tensor`
- :cpp:class:`~tvm::ffi::Tensor` is a completely owning tensor with reference 
counting. It can be created on either C++ or Python side and passed between 
either side. And TVM FFI internally keeps a reference count to track lifetime 
of the tensors. When the reference count goes to zero, its underlying deleter 
function will be called to free the tensor storage.
+Every TVM-FFI CUDA kernel follows the same sequence:
 
-:cpp:class:`tvm::ffi::TensorView`
- :cpp:class:`~tvm::ffi::TensorView` is a non-owning view of an existing 
tensor, pointing to an existing tensor (e.g., a tensor allocated by PyTorch).
+1. **Validate** inputs (device, dtype, shape, contiguity)
+2. **Set device guard** to match the tensor's device
+3. **Acquire stream** from the host framework
+4. **Dispatch** on dtype and **launch** the kernel
 
-It is **recommended** to use :cpp:class:`~tvm::ffi::TensorView` when possible, 
that helps us to support more cases, including cases where only view but not 
strong reference are passed, like XLA buffer. It is also more lightweight. 
However, since :cpp:class:`~tvm::ffi::TensorView` is a non-owning view, it is 
the user's responsibility to ensure the lifetime of underlying tensor.
+Here is a complete ``Scale`` kernel that computes ``y = x * factor``:
 
-Tensor Attributes
------------------
+.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu
+   :language: cpp
+   :start-after: [function.begin]
+   :end-before: [function.end]
 
-For convenience, :cpp:class:`~tvm::ffi::TensorView` and 
:cpp:class:`~tvm::ffi::Tensor` align the following attributes retrieval mehtods 
to :cpp:class:`at::Tensor` interface, to obtain tensor basic attributes and 
storage pointer:
-``dim``, ``dtype``, ``sizes``, ``size``, ``strides``, ``stride``, ``numel``, 
``data_ptr``, ``device``, ``is_contiguous``
+The CUDA kernel itself is a standard ``__global__`` function:
 
-Please refer to the documentation of both tensor classes for their details. 
Here  highlight some non-primitive attributes:
+.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu
+   :language: cpp
+   :start-after: [cuda_kernel.begin]
+   :end-before: [cuda_kernel.end]
 
-:c:struct:`DLDataType`
- The ``dtype`` of the tensor. It's represented by a struct with three fields: 
code, bits, and lanes, defined by DLPack protocol.
+The following subsections break down each step.
 
-:c:struct:`DLDevice`
- The ``device`` where the tensor is stored. It is represented by a struct with 
two fields: device_type and device_id, defined by DLPack protocol.
 
-:cpp:class:`tvm::ffi::ShapeView`
- The ``sizes`` and ``strides`` attributes retrieval are returned as 
:cpp:class:`~tvm::ffi::ShapeView`. It is an iterate-able data structure storing 
the shapes or strides data as ``int64_t`` array.
+Input Validation
+~~~~~~~~~~~~~~~~
 
-Tensor Allocation
------------------
+Kernel functions should validate inputs early and fail with clear error 
messages.
+A common pattern is to define reusable ``CHECK_*`` macros on top of
+:c:macro:`TVM_FFI_CHECK` (see :doc:`../concepts/exception_handling`):
+
+.. literalinclude:: ../../examples/kernel_library/tvm_ffi_utils.h
+   :language: cpp
+   :start-after: [check_macros.begin]
+   :end-before: [check_macros.end]
+
+For **user-facing errors** (bad arguments, unsupported dtypes, shape 
mismatches),
+use :c:macro:`TVM_FFI_THROW` or :c:macro:`TVM_FFI_CHECK` with a specific error 
kind
+so that callers receive an actionable message:
+
+.. code-block:: cpp
+
+   TVM_FFI_THROW(TypeError) << "Unsupported dtype: " << input.dtype();
+   TVM_FFI_CHECK(input.numel() > 0, ValueError) << "input must be non-empty";
+   TVM_FFI_CHECK(input.numel() == output.numel(), ValueError) << "size 
mismatch";
+
+For **internal invariants** that indicate bugs in the kernel itself, use
+:c:macro:`TVM_FFI_ICHECK`:
+
+.. code-block:: cpp
+
+   TVM_FFI_ICHECK_GE(n, 0) << "element count must be non-negative";
+
+
+Device Guard and Stream
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Before launching a CUDA kernel, two things must happen:
+
+1. **Set the CUDA device** to match the tensor's device. 
:cpp:class:`tvm::ffi::CUDADeviceGuard`
+   is an RAII guard that calls ``cudaSetDevice`` on construction and restores 
the
+   original device on destruction.
+
+2. **Acquire the stream** from the host framework via 
:cpp:func:`TVMFFIEnvGetStream`.
+   When Python code calls a kernel with PyTorch tensors, TVM-FFI automatically
+   captures PyTorch's current stream for the tensor's device.
+
+A small helper keeps this concise:
+
+.. literalinclude:: ../../examples/kernel_library/tvm_ffi_utils.h
+   :language: cpp
+   :start-after: [get_stream.begin]
+   :end-before: [get_stream.end]
+
+Every kernel function then follows the same two-line pattern:
+
+.. code-block:: cpp
+
+   ffi::CUDADeviceGuard guard(input.device().device_id);
+   cudaStream_t stream = get_cuda_stream(input.device());
+
+See :doc:`../concepts/tensor` for details on stream handling and automatic 
stream
+context updates.
 
-TVM FFI provides several methods to create or allocate tensors at C++ runtime. 
Generally, there are two types of tensor creation methods:
 
-* Allocate a tensor with new storage from scratch, i.e. 
:cpp:func:`~tvm::ffi::Tensor::FromEnvAlloc` and 
:cpp:func:`~tvm::ffi::Tensor::FromNDAlloc`. By this types of methods, the 
shapes, strides, data types, devices and other attributes are required for the 
allocation.
-* Create a tensor with existing storage following DLPack protocol, i.e. 
:cpp:func:`~tvm::ffi::Tensor::FromDLPack` and 
:cpp:func:`~tvm::ffi::Tensor::FromDLPackVersioned`. By this types of methods, 
the shapes, data types, devices and other attributes can be inferred from the 
DLPack attributes.
+Dtype Dispatch
+~~~~~~~~~~~~~~
 
-FromEnvAlloc
-^^^^^^^^^^^^
+Kernels typically support multiple dtypes. Dispatch on :c:struct:`DLDataType` 
at
+runtime while instantiating templates at compile time:
 
-To better adapt to the ML framework, it is **recommended** to reuse the 
framework tensor allocator anyway, instead of directly allocating the tensors 
via CUDA runtime API, like ``cudaMalloc``. Since reusing the framework tensor 
allocator:
+.. code-block:: cpp
 
-* Benefit from the framework's native caching allocator or related allocation 
mechanism.
-* Help framework tracking memory usage and planning globally.
+   constexpr DLDataType dl_float32 = DLDataType{kDLFloat, 32, 1};
+   constexpr DLDataType dl_float16 = DLDataType{kDLFloat, 16, 1};
 
-TVM FFI provides :cpp:func:`tvm::ffi::Tensor::FromEnvAlloc` to allocate a 
tensor with the framework tensor allocator. To determine which framework tensor 
allocator, TVM FFI infers it from the passed-in framework tensors. For example, 
when calling the kernel library at Python side, there is an input framework 
tensor if of type ``torch.Tensor``, TVM FFI will automatically bind the 
:cpp:func:`at::empty` as the current framework tensor allocator by 
``TVMFFIEnvTensorAlloc``. And then the 
:cpp:func:`~tvm::ffi::Tensor::FromEnvAlloc` is calling the 
:cpp:class:`at::empty` actually:
+   if (input.dtype() == dl_float32) {
+     ScaleKernel<<<blocks, threads, 0, stream>>>(
+         static_cast<float*>(output.data_ptr()), ...);
+   } else if (input.dtype() == dl_float16) {
+     ScaleKernel<<<blocks, threads, 0, stream>>>(
+         static_cast<half*>(output.data_ptr()), ...);
+   } else {
+     TVM_FFI_THROW(TypeError) << "Unsupported dtype: " << input.dtype();
+   }
 
-.. code-block:: c++
+For libraries that support many dtypes, define dispatch macros
+(see `FlashInfer's tvm_ffi_utils.h 
<https://github.com/flashinfer-ai/flashinfer/blob/main/csrc/tvm_ffi_utils.h>`_
+for a production example).
 
- ffi::Tensor tensor = ffi::Tensor::FromEnvAlloc(TVMFFIEnvTensorAlloc, ...);
 
-which is equivalent to:
+Export and Load
+---------------
 
-.. code-block:: c++
+Export and Build
+~~~~~~~~~~~~~~~~
 
- at::Tensor tensor = at::empty(...);
+**Export.** Use :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` to create a C symbol
+that follows the :doc:`TVM-FFI calling convention <../concepts/func_module>`:
 
-FromNDAlloc
-^^^^^^^^^^^
+.. literalinclude:: ../../examples/kernel_library/scale_kernel.cu
+   :language: cpp
+   :start-after: [export.begin]
+   :end-before: [export.end]
 
-:cpp:func:`tvm::ffi::Tensor::FromNDAlloc` can be used to create a tensor with 
custom memory allocator. It is of simple usage by providing a custom memory 
allocator and deleter for tensor allocation and free each, rather than relying 
on any framework tensor allocator.
+This creates a symbol ``__tvm_ffi_scale`` in the shared library. For larger
+projects, a common pattern is to keep kernel implementations and export
+declarations in **separate files** — an implementation file and a binding file:

Review Comment:
   Opus 4.6 hallucinates 🫠



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to