yaoyaoding commented on code in PR #283: URL: https://github.com/apache/tvm-ffi/pull/283#discussion_r2558615638
########## docs/guides/cubin_launcher.md: ########## @@ -0,0 +1,471 @@ +<!--- Licensed to the Apache Software Foundation (ASF) under one --> +<!--- or more contributor license agreements. See the NOTICE file --> +<!--- distributed with this work for additional information --> +<!--- regarding copyright ownership. The ASF licenses this file --> +<!--- to you under the Apache License, Version 2.0 (the --> +<!--- "License"); you may not use this file except in compliance --> +<!--- with the License. You may obtain a copy of the License at --> + +<!--- http://www.apache.org/licenses/LICENSE-2.0 --> + +<!--- Unless required by applicable law or agreed to in writing, --> +<!--- software distributed under the License is distributed on an --> +<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY --> +<!--- KIND, either express or implied. See the License for the --> +<!--- specific language governing permissions and limitations --> +<!--- under the License. --> + +# CUBIN Launcher Guide + +This guide demonstrates how to load and launch CUDA kernels from CUBIN (CUDA Binary) modules using TVM-FFI. The CUBIN launcher enables you to execute pre-compiled or runtime-compiled CUDA kernels efficiently through the CUDA Driver API. + +## Overview + +TVM-FFI provides utilities for loading and launching CUDA kernels from CUBIN modules. The implementation is in `tvm/ffi/extra/cuda/cubin_launcher.h` and provides: + +- {cpp:class}`tvm::ffi::CubinModule`: RAII wrapper for loading CUBIN modules from memory +- {cpp:class}`tvm::ffi::CubinKernel`: Handle for launching CUDA kernels with specified parameters +- {c:macro}`TVM_FFI_EMBED_CUBIN`: Macro for embedding CUBIN data at compile time +- {c:macro}`TVM_FFI_EMBED_CUBIN_GET_KERNEL`: Macro for retrieving kernels from embedded CUBIN + +The CUBIN launcher supports: + +- Loading CUBIN from memory (embedded data or runtime-generated) +- Multi-GPU execution using CUDA primary contexts +- Kernel parameter management and launch configuration +- Integration with NVRTC, Triton, and other CUDA compilation tools + +## Python Usage + +### Basic Workflow + +The typical workflow for launching CUBIN kernels from Python involves: + +1. **Generate CUBIN**: Compile your CUDA kernel to CUBIN format +2. **Define C++ Wrapper**: Write C++ code to load and launch the kernel +3. **Load Module**: Use {py:func}`tvm_ffi.cpp.load_inline` with `embed_cubin` parameter +4. **Call Kernel**: Invoke the kernel function from Python + +### Example: NVRTC Compilation + +Here's a complete example using NVRTC to compile CUDA source at runtime: + +```python +import torch +from tvm_ffi import cpp +from tvm_ffi.cpp import nvrtc + +# Step 1: Define CUDA kernel source +cuda_source = """ +extern "C" __global__ void add_one(float* x, float* y, int n) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx < n) { + y[idx] = x[idx] + 1.0f; + } +} +""" + +# Step 2: Compile to CUBIN using NVRTC +cubin_bytes = nvrtc.nvrtc_compile(cuda_source, name="kernel.cu") + +# Step 3: Define C++ wrapper with embedded CUBIN +cpp_wrapper = """ +#include <tvm/ffi/container/tensor.h> +#include <tvm/ffi/error.h> +#include <tvm/ffi/extra/c_env_api.h> +#include <tvm/ffi/extra/cuda/cubin_launcher.h> +#include <tvm/ffi/function.h> + +// Declare embedded CUBIN module +TVM_FFI_EMBED_CUBIN(my_cubin); + +void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) { + // Get kernel from embedded CUBIN (cached for efficiency) + static auto kernel = TVM_FFI_EMBED_CUBIN_GET_KERNEL(my_cubin, "add_one"); + + // Prepare kernel arguments + int64_t n = x.size(0); + void* x_ptr = x.data_ptr(); + void* y_ptr = y.data_ptr(); + void* args[] = {&x_ptr, &y_ptr, &n}; + + // Configure launch parameters + tvm::ffi::dim3 grid((n + 255) / 256); + tvm::ffi::dim3 block(256); + + // Get CUDA stream and launch + DLDevice device = x.device(); + CUstream stream = static_cast<CUstream>( + TVMFFIEnvGetStream(device.device_type, device.device_id)); + + CUresult result = kernel.Launch(args, grid, block, stream); + TVM_FFI_CHECK_CUDA_DRIVER_ERROR(result); +} + +TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, AddOne); +""" + +# Step 4: Load module with embedded CUBIN +mod = cpp.load_inline( + "my_module", + cuda_sources=cpp_wrapper, + embed_cubin={"my_cubin": cubin_bytes} +) + +# Step 5: Use the kernel +x = torch.arange(1024, dtype=torch.float32, device="cuda") +y = torch.empty_like(x) +mod.add_one(x, y) + +# Verify results +assert torch.allclose(y, x + 1) +``` + +**Key Points:** + +- The `embed_cubin` parameter is a dictionary mapping CUBIN names to their binary data +- CUBIN names in `embed_cubin` must match names in {c:macro}`TVM_FFI_EMBED_CUBIN` +- Use `cuda_sources` parameter (instead of `cpp_sources`) to automatically link with CUDA libraries +- The C++ wrapper handles device management, stream handling, and kernel launching + +### Example: Using Triton Kernels + +You can compile Triton kernels to CUBIN and launch them through TVM-FFI: + +```python Review Comment: Cool, I've updated to use the technique! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
