tqchen commented on code in PR #283:
URL: https://github.com/apache/tvm-ffi/pull/283#discussion_r2558392383


##########
docs/guides/cubin_launcher.md:
##########
@@ -0,0 +1,471 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+# CUBIN Launcher Guide
+
+This guide demonstrates how to load and launch CUDA kernels from CUBIN (CUDA 
Binary) modules using TVM-FFI. The CUBIN launcher enables you to execute 
pre-compiled or runtime-compiled CUDA kernels efficiently through the CUDA 
Driver API.
+
+## Overview
+
+TVM-FFI provides utilities for loading and launching CUDA kernels from CUBIN 
modules. The implementation is in `tvm/ffi/extra/cuda/cubin_launcher.h` and 
provides:
+
+- {cpp:class}`tvm::ffi::CubinModule`: RAII wrapper for loading CUBIN modules 
from memory
+- {cpp:class}`tvm::ffi::CubinKernel`: Handle for launching CUDA kernels with 
specified parameters
+- {c:macro}`TVM_FFI_EMBED_CUBIN`: Macro for embedding CUBIN data at compile 
time
+- {c:macro}`TVM_FFI_EMBED_CUBIN_GET_KERNEL`: Macro for retrieving kernels from 
embedded CUBIN
+
+The CUBIN launcher supports:
+
+- Loading CUBIN from memory (embedded data or runtime-generated)
+- Multi-GPU execution using CUDA primary contexts
+- Kernel parameter management and launch configuration
+- Integration with NVRTC, Triton, and other CUDA compilation tools
+
+## Python Usage
+
+### Basic Workflow
+
+The typical workflow for launching CUBIN kernels from Python involves:
+
+1. **Generate CUBIN**: Compile your CUDA kernel to CUBIN format
+2. **Define C++ Wrapper**: Write C++ code to load and launch the kernel
+3. **Load Module**: Use {py:func}`tvm_ffi.cpp.load_inline` with `embed_cubin` 
parameter
+4. **Call Kernel**: Invoke the kernel function from Python
+
+### Example: NVRTC Compilation
+
+Here's a complete example using NVRTC to compile CUDA source at runtime:
+
+```python
+import torch
+from tvm_ffi import cpp
+from tvm_ffi.cpp import nvrtc
+
+# Step 1: Define CUDA kernel source
+cuda_source = """
+extern "C" __global__ void add_one(float* x, float* y, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n) {
+        y[idx] = x[idx] + 1.0f;
+    }
+}
+"""
+
+# Step 2: Compile to CUBIN using NVRTC
+cubin_bytes = nvrtc.nvrtc_compile(cuda_source, name="kernel.cu")
+
+# Step 3: Define C++ wrapper with embedded CUBIN
+cpp_wrapper = """
+#include <tvm/ffi/container/tensor.h>
+#include <tvm/ffi/error.h>
+#include <tvm/ffi/extra/c_env_api.h>
+#include <tvm/ffi/extra/cuda/cubin_launcher.h>
+#include <tvm/ffi/function.h>
+
+// Declare embedded CUBIN module
+TVM_FFI_EMBED_CUBIN(my_cubin);
+
+void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) {
+  // Get kernel from embedded CUBIN (cached for efficiency)
+  static auto kernel = TVM_FFI_EMBED_CUBIN_GET_KERNEL(my_cubin, "add_one");
+
+  // Prepare kernel arguments
+  int64_t n = x.size(0);
+  void* x_ptr = x.data_ptr();
+  void* y_ptr = y.data_ptr();
+  void* args[] = {&x_ptr, &y_ptr, &n};
+
+  // Configure launch parameters
+  tvm::ffi::dim3 grid((n + 255) / 256);
+  tvm::ffi::dim3 block(256);
+
+  // Get CUDA stream and launch
+  DLDevice device = x.device();
+  CUstream stream = static_cast<CUstream>(
+      TVMFFIEnvGetStream(device.device_type, device.device_id));
+
+  CUresult result = kernel.Launch(args, grid, block, stream);
+  TVM_FFI_CHECK_CUDA_DRIVER_ERROR(result);
+}
+
+TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, AddOne);
+"""
+
+# Step 4: Load module with embedded CUBIN
+mod = cpp.load_inline(
+    "my_module",
+    cuda_sources=cpp_wrapper,
+    embed_cubin={"my_cubin": cubin_bytes}
+)
+
+# Step 5: Use the kernel
+x = torch.arange(1024, dtype=torch.float32, device="cuda")
+y = torch.empty_like(x)
+mod.add_one(x, y)
+
+# Verify results
+assert torch.allclose(y, x + 1)
+```
+
+**Key Points:**
+
+- The `embed_cubin` parameter is a dictionary mapping CUBIN names to their 
binary data
+- CUBIN names in `embed_cubin` must match names in 
{c:macro}`TVM_FFI_EMBED_CUBIN`
+- Use `cuda_sources` parameter (instead of `cpp_sources`) to automatically 
link with CUDA libraries
+- The C++ wrapper handles device management, stream handling, and kernel 
launching
+
+### Example: Using Triton Kernels
+
+You can compile Triton kernels to CUBIN and launch them through TVM-FFI:
+
+```python

Review Comment:
   i remember @junrushao mentioned there is a way for us to refer to existing 
file in the repo from rst so we don't have to copy code, checkout quick start



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to