yzh119 opened a new issue, #497: URL: https://github.com/apache/tvm-ffi/issues/497
## Feature Request: `cuLaunchKernelEx` support in `CubinKernel::Launch` ### Context We're migrating FlashInfer's cubin loading infrastructure to use `tvm::ffi::CubinModule` / `CubinKernel` (per the [cubin launcher guide](https://tvm.apache.org/ffi/guides/cubin_launcher.html)). FlashInfer loads pre-compiled cubins at runtime for TRT-LLM attention, GEMM, MoE, and cuDNN SDPA kernels. ### Problem `CubinKernel::Launch()` currently wraps `cuLaunchKernel` (via `cuda_api::LaunchKernel` in `unified_api.h`), which only supports basic grid/block/stream/shared_mem parameters: ```cpp // Current API cuda_api::ResultType Launch(void** args, dim3 grid, dim3 block, cuda_api::StreamHandle stream, uint32_t dyn_smem_bytes = 0); ``` Several of our kernels require `cuLaunchKernelEx`, which takes a `CUlaunchConfig` struct supporting: - **Cluster dimensions** (`CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`) — needed for Hopper/Blackwell SM90+/SM100+ kernels that use thread block clusters - **Cluster scheduling policy** (`CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE`) - **Programmatic stream serialization / PDL** (`CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION`) - **Non-portable cluster sizes** (via `cuFuncSetAttribute` on the kernel before launch) Example from our TRT-LLM FMHA kernel launcher: ```cpp CUlaunchConfig launch_config; launch_config.gridDimX = numCtasX; launch_config.gridDimY = numCtasY; launch_config.gridDimZ = numCtasZ; launch_config.blockDimX = threadsPerCTA; launch_config.blockDimY = 1; launch_config.blockDimZ = 1; launch_config.hStream = stream; launch_config.sharedMemBytes = sharedMemBytes; CUlaunchAttribute attrs[3]; attrs[0].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION; attrs[0].value.clusterDim = {clusterDimX, 1, 1}; attrs[1].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE; attrs[1].value.clusterSchedulingPolicyPreference = CU_CLUSTER_SCHEDULING_POLICY_SPREAD; attrs[2].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION; attrs[2].value.programmaticStreamSerializationAllowed = enable_pdl; launch_config.attrs = attrs; launch_config.numAttrs = 3; cuLaunchKernelEx(&launch_config, func, kernelParamsList, nullptr); ``` Without this, we can use `CubinModule` for loading and `CubinKernel` for kernel retrieval, but have to drop down to raw CUDA API for the actual launch — which defeats the purpose of having a unified abstraction. ### Proposed API Option A — Extended `Launch` overload: ```cpp // New overload accepting launch attributes cuda_api::ResultType Launch(void** args, dim3 grid, dim3 block, cuda_api::StreamHandle stream, uint32_t dyn_smem_bytes, cuda_api::LaunchAttrType* attrs, int num_attrs); ``` Option B — Accept a `CUlaunchConfig` / `cudaLaunchConfig_t` directly: ```cpp // Pass the full launch config (already defined as cuda_api::LaunchConfig) cuda_api::ResultType LaunchEx(cuda_api::LaunchConfig* config, void** args); ``` Option B is simpler and forward-compatible with future launch attributes. ### Impact This would let us fully adopt `CubinModule`/`CubinKernel` for all our cubin-based kernels. Currently we have ~6 files that load cubins covering attention, GEMM, MoE, and cuDNN backends — all require cluster launch support on SM90+/SM100+. ### Workaround We can extract the raw handle via `CubinKernel::GetHandle()` and cast it for `cuLaunchKernelEx`, but this breaks the abstraction and ties us to driver API internals: ```cpp // Works but fragile cuLaunchKernelEx(&config, reinterpret_cast<CUfunction>(kernel.GetHandle()), ...); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
