yaoyaoding commented on code in PR #283:
URL: https://github.com/apache/tvm-ffi/pull/283#discussion_r2561705438


##########
docs/guides/cubin_launcher.rst:
##########
@@ -0,0 +1,422 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+CUBIN Launcher Guide
+====================
+
+This guide demonstrates how to load and launch CUDA kernels from CUBIN (CUDA 
Binary) modules using TVM-FFI. The CUBIN launcher enables you to execute 
pre-compiled or runtime-compiled CUDA kernels efficiently through the CUDA 
Runtime API.
+
+Overview
+--------
+
+TVM-FFI provides utilities for loading and launching CUDA kernels from CUBIN 
modules. The implementation is in ``tvm/ffi/extra/cuda/cubin_launcher.h`` and 
provides:
+
+- :cpp:class:`tvm::ffi::CubinModule`: RAII wrapper for loading CUBIN modules 
from memory
+- :cpp:class:`tvm::ffi::CubinKernel`: Handle for launching CUDA kernels with 
specified parameters
+- :c:macro:`TVM_FFI_EMBED_CUBIN`: Macro for embedding CUBIN data at compile 
time
+- :c:macro:`TVM_FFI_EMBED_CUBIN_GET_KERNEL`: Macro for retrieving kernels from 
embedded CUBIN
+
+The CUBIN launcher supports:
+
+- Loading CUBIN from memory (embedded data or runtime-generated)
+- Multi-GPU execution using CUDA primary contexts
+- Kernel parameter management and launch configuration
+- Integration with NVRTC, Triton, and other CUDA compilation tools
+
+**Build Integration:**
+
+TVM-FFI provides convenient tools for embedding CUBIN data at build time:
+
+- **CMake utilities** (``cmake/Utils/EmbedCubin.cmake``): Functions for 
compiling CUDA to CUBIN and embedding it into C++ code
+- **Python utility** (``python -m tvm_ffi.utils.embed_cubin``): Command-line 
tool for embedding CUBIN into object files
+- **Python API** (:py:func:`tvm_ffi.cpp.load_inline`): Runtime embedding via 
``embed_cubin`` parameter
+
+Python Usage
+------------
+
+Basic Workflow
+~~~~~~~~~~~~~~
+
+The typical workflow for launching CUBIN kernels from Python involves:
+
+1. **Generate CUBIN**: Compile your CUDA kernel to CUBIN format
+2. **Define C++ Wrapper**: Write C++ code to load and launch the kernel
+3. **Load Module**: Use :py:func:`tvm_ffi.cpp.load_inline` with 
``embed_cubin`` parameter
+4. **Call Kernel**: Invoke the kernel function from Python
+
+Example: NVRTC Compilation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here's a complete example using NVRTC to compile CUDA source at runtime.
+
+**Step 1: Compile CUDA source to CUBIN using NVRTC**
+
+.. literalinclude:: ../../examples/cubin_launcher/example_nvrtc_cubin.py
+   :language: python
+   :start-after: [cuda_source.begin]
+   :end-before: [cuda_source.end]
+   :dedent: 4
+
+**Step 2: Define C++ wrapper with embedded CUBIN**
+
+.. literalinclude:: ../../examples/cubin_launcher/example_nvrtc_cubin.py
+   :language: python
+   :start-after: [cpp_wrapper.begin]
+   :end-before: [cpp_wrapper.end]
+   :dedent: 4
+
+**Key Points:**
+
+- The ``embed_cubin`` parameter is a dictionary mapping CUBIN names to their 
binary data
+- CUBIN names in ``embed_cubin`` must match names in 
:c:macro:`TVM_FFI_EMBED_CUBIN`
+- Use ``cuda_sources`` parameter (instead of ``cpp_sources``) to automatically 
link with CUDA libraries
+- The C++ wrapper handles device management, stream handling, and kernel 
launching
+
+Example: Using Triton Kernels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can compile Triton kernels to CUBIN and launch them through TVM-FFI.
+
+**Step 1: Define and compile Triton kernel**
+
+.. literalinclude:: ../../examples/cubin_launcher/example_triton_cubin.py
+   :language: python
+   :start-after: [triton_kernel.begin]
+   :end-before: [triton_kernel.end]
+   :dedent: 4
+
+**Step 2: Define C++ wrapper to launch the Triton kernel**
+
+.. literalinclude:: ../../examples/cubin_launcher/example_triton_cubin.py
+   :language: python
+   :start-after: [cpp_wrapper.begin]
+   :end-before: [cpp_wrapper.end]
+   :dedent: 4
+
+.. note::
+
+   Triton kernels may require extra dummy parameters in the argument list. 
Check the compiled kernel's signature to determine the exact parameter count 
needed.
+
+C++ Usage
+---------
+
+Embedding CUBIN at Compile Time
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The recommended approach in C++ is to embed CUBIN data directly into your 
shared library:
+
+.. literalinclude:: 
../../examples/cubin_launcher/embedded_cubin/src/lib_embedded.cc
+   :language: cpp
+   :start-after: [example.begin]
+   :end-before: [example.end]
+
+**Key Points:**
+
+- Use ``static auto kernel`` to cache the kernel lookup for efficiency
+- Kernel arguments must be pointers to the actual values (use ``&`` for 
addresses)
+- :cpp:type:`tvm::ffi::dim3` supports 1D, 2D, or 3D configurations: 
``dim3(x)``, ``dim3(x, y)``, ``dim3(x, y, z)``
+- ``TVMFFIEnvGetStream`` retrieves the correct CUDA stream for the device
+- Always check kernel launch results with :c:macro:`TVM_FFI_CHECK_CUDA_ERROR` 
(which checks CUDA Runtime API errors)
+
+Loading CUBIN at Runtime
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can also load CUBIN modules dynamically from memory:
+
+.. literalinclude:: 
../../examples/cubin_launcher/dynamic_cubin/src/lib_dynamic.cc
+   :language: cpp
+   :start-after: [example.begin]
+   :end-before: [example.end]
+
+Embedding CUBIN with CMake Utilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TVM-FFI provides CMake utility functions that simplify the CUBIN embedding 
process. This is the recommended approach for CMake-based projects.
+
+**Using CMake Utilities:**
+
+.. literalinclude:: ../../examples/cubin_launcher/embedded_cubin/CMakeLists.txt
+   :language: cmake
+   :start-after: [cmake_example.begin]
+   :end-before: [cmake_example.end]
+
+**Available CMake Functions:**
+
+- ``tvm_ffi_generate_cubin()``: Compiles CUDA source to CUBIN using nvcc
+
+  - ``OUTPUT``: Path to output CUBIN file
+  - ``SOURCE``: Path to CUDA source file
+  - ``ARCH``: Target GPU architecture (default: ``native`` for auto-detection)
+  - ``OPTIONS``: Additional nvcc compiler options (optional)
+  - ``DEPENDS``: Additional dependencies (optional)
+
+- ``tvm_ffi_embed_cubin()``: Compiles C++ source and embeds CUBIN data
+
+  - ``OUTPUT``: Path to output combined object file
+  - ``SOURCE``: Path to C++ source file with ``TVM_FFI_EMBED_CUBIN`` macro
+  - ``CUBIN``: Path to CUBIN file to embed
+  - ``NAME``: Symbol name used in ``TVM_FFI_EMBED_CUBIN(name)`` macro
+  - ``DEPENDS``: Additional dependencies (optional)
+
+The utilities automatically handle:
+
+- Compiling C++ source to intermediate object file
+- Creating CUBIN symbols with proper naming
+- Merging object files using ``ld -r``
+- Adding ``.note.GNU-stack`` section for security
+- Localizing symbols to prevent conflicts
+
+Embedding CUBIN with Python Utility
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For more advanced use cases or non-CMake build systems, you can use the Python 
command-line utility to embed CUBIN data into existing object files.
+
+**Command-Line Usage:**
+
+.. code-block:: bash
+
+   # Step 1: Compile C++ source to object file
+   g++ -c -fPIC -std=c++17 -I/path/to/tvm-ffi/include mycode.cc -o mycode.o
+
+   # Step 2: Embed CUBIN into the object file
+   python -m tvm_ffi.utils.embed_cubin \
+       --output-obj mycode_with_cubin.o \
+       --input-obj mycode.o \
+       --cubin kernel.cubin \
+       --name my_kernels
+
+   # Step 3: Link into final library
+   g++ -o mylib.so -shared mycode_with_cubin.o -lcudart
+
+**Python API:**
+
+.. code-block:: python
+
+   from pathlib import Path
+   from tvm_ffi.utils.embed_cubin import embed_cubin
+
+   embed_cubin(
+       cubin_path=Path("kernel.cubin"),
+       input_obj_path=Path("mycode.o"),
+       output_obj_path=Path("mycode_with_cubin.o"),
+       name="my_kernels",
+       verbose=True  # Optional: print detailed progress
+   )
+
+The Python utility performs these steps:
+
+1. Creates intermediate CUBIN object file using ``ld -r -b binary``
+2. Adds ``.note.GNU-stack`` section for security
+3. Renames symbols to match TVM-FFI format (``__tvm_ffi__cubin_<name>``)
+4. Merges with input object file using relocatable linking
+5. Localizes symbols to prevent conflicts when multiple object files use the 
same name
+
+
+Manual CUBIN Embedding
+~~~~~~~~~~~~~~~~~~~~~~
+
+For reference, here's how to manually embed CUBIN using objcopy and ld:
+
+**Step 1: Compile CUDA kernel to CUBIN**
+
+.. code-block:: bash
+
+   nvcc --cubin -arch=sm_75 kernel.cu -o kernel.cubin
+
+**Step 2: Convert CUBIN to object file**
+
+.. code-block:: bash
+
+   ld -r -b binary -o kernel_data.o kernel.cubin
+
+**Step 3: Rename symbols with objcopy**
+
+.. code-block:: bash
+
+   objcopy --rename-section .data=.rodata,alloc,load,readonly,data,contents \
+           --redefine-sym 
_binary_kernel_cubin_start=__tvm_ffi__cubin_my_kernels \
+           --redefine-sym 
_binary_kernel_cubin_end=__tvm_ffi__cubin_my_kernels_end \
+           kernel_data.o
+
+**Step 4: Link with your library**
+
+.. code-block:: bash
+
+   g++ -o mylib.so -shared mycode.cc kernel_data.o -Wl,-z,noexecstack -lcuda

Review Comment:
   good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to