[I] Feature Request: cuLaunchKernelEx support in CubinKernel::Launch [tvm-ffi]

via GitHub Fri, 06 Mar 2026 13:32:26 -0800


yzh119 opened a new issue, #497:
URL: https://github.com/apache/tvm-ffi/issues/497


   ## Feature Request: `cuLaunchKernelEx` support in `CubinKernel::Launch`
   
   ### Context
   
   We're migrating FlashInfer's cubin loading infrastructure to use 
`tvm::ffi::CubinModule` / `CubinKernel` (per the [cubin launcher 
guide](https://tvm.apache.org/ffi/guides/cubin_launcher.html)). FlashInfer 
loads pre-compiled cubins at runtime for TRT-LLM attention, GEMM, MoE, and 
cuDNN SDPA kernels.
   
   ### Problem
   
   `CubinKernel::Launch()` currently wraps `cuLaunchKernel` (via 
`cuda_api::LaunchKernel` in `unified_api.h`), which only supports basic 
grid/block/stream/shared_mem parameters:
   
   ```cpp
   // Current API
   cuda_api::ResultType Launch(void** args, dim3 grid, dim3 block,
                               cuda_api::StreamHandle stream,
                               uint32_t dyn_smem_bytes = 0);
   ```
   
   Several of our kernels require `cuLaunchKernelEx`, which takes a 
`CUlaunchConfig` struct supporting:
   - **Cluster dimensions** (`CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`) — needed 
for Hopper/Blackwell SM90+/SM100+ kernels that use thread block clusters
   - **Cluster scheduling policy** 
(`CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE`)
   - **Programmatic stream serialization / PDL** 
(`CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION`)
   - **Non-portable cluster sizes** (via `cuFuncSetAttribute` on the kernel 
before launch)
   
   Example from our TRT-LLM FMHA kernel launcher:
   ```cpp
   CUlaunchConfig launch_config;
   launch_config.gridDimX = numCtasX;
   launch_config.gridDimY = numCtasY;
   launch_config.gridDimZ = numCtasZ;
   launch_config.blockDimX = threadsPerCTA;
   launch_config.blockDimY = 1;
   launch_config.blockDimZ = 1;
   launch_config.hStream = stream;
   launch_config.sharedMemBytes = sharedMemBytes;
   
   CUlaunchAttribute attrs[3];
   attrs[0].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION;
   attrs[0].value.clusterDim = {clusterDimX, 1, 1};
   attrs[1].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE;
   attrs[1].value.clusterSchedulingPolicyPreference = 
CU_CLUSTER_SCHEDULING_POLICY_SPREAD;
   attrs[2].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION;
   attrs[2].value.programmaticStreamSerializationAllowed = enable_pdl;
   
   launch_config.attrs = attrs;
   launch_config.numAttrs = 3;
   
   cuLaunchKernelEx(&launch_config, func, kernelParamsList, nullptr);
   ```
   
   Without this, we can use `CubinModule` for loading and `CubinKernel` for 
kernel retrieval, but have to drop down to raw CUDA API for the actual launch — 
which defeats the purpose of having a unified abstraction.
   
   ### Proposed API
   
   Option A — Extended `Launch` overload:
   ```cpp
   // New overload accepting launch attributes
   cuda_api::ResultType Launch(void** args, dim3 grid, dim3 block,
                               cuda_api::StreamHandle stream,
                               uint32_t dyn_smem_bytes,
                               cuda_api::LaunchAttrType* attrs,
                               int num_attrs);
   ```
   
   Option B — Accept a `CUlaunchConfig` / `cudaLaunchConfig_t` directly:
   ```cpp
   // Pass the full launch config (already defined as cuda_api::LaunchConfig)
   cuda_api::ResultType LaunchEx(cuda_api::LaunchConfig* config,
                                 void** args);
   ```
   
   Option B is simpler and forward-compatible with future launch attributes.
   
   ### Impact
   
   This would let us fully adopt `CubinModule`/`CubinKernel` for all our 
cubin-based kernels. Currently we have ~6 files that load cubins covering 
attention, GEMM, MoE, and cuDNN backends — all require cluster launch support 
on SM90+/SM100+.
   
   ### Workaround
   
   We can extract the raw handle via `CubinKernel::GetHandle()` and cast it for 
`cuLaunchKernelEx`, but this breaks the abstraction and ties us to driver API 
internals:
   ```cpp
   // Works but fragile
   cuLaunchKernelEx(&config, reinterpret_cast<CUfunction>(kernel.GetHandle()), 
...);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Feature Request: cuLaunchKernelEx support in CubinKernel::Launch [tvm-ffi]

Reply via email to