This is an automated email from the ASF dual-hosted git repository.

spectrometerHBH pushed a commit to branch tir-bench
in repository https://gitbox.apache.org/repos/asf/tvm.git

commit a394fd58b073399eb75be81fe31201cf5b9247af
Author: Bohan Hou <[email protected]>
AuthorDate: Sun May 24 10:43:24 2026 -0700

    docs: update tir bench baseline results (#642)
---
 .claude/commands/tir-bench.md | 317 ++++++++++++++++++++++--------------------
 1 file changed, 167 insertions(+), 150 deletions(-)

diff --git a/.claude/commands/tir-bench.md b/.claude/commands/tir-bench.md
index 515863829b..06dbb6f680 100644
--- a/.claude/commands/tir-bench.md
+++ b/.claude/commands/tir-bench.md
@@ -5,14 +5,14 @@ Run kernel performance benchmarks to verify codegen changes.
 All commands use `--warmup 100 --repeat 30` for ~3-minute total runtime with 
reliable medians. Drop to defaults only when chasing a sub-2% regression.
 
 - **GEMM**: square GEMM at M=N=K in {1024, 2048, 4096, 8192, 16384} for three 
variants:
-  - fp16: `python -m tirx_kernels.bench --kernel fp16_bf16_gemm --warmup 100 
--repeat 30`
-  - fp8: `python -m tirx_kernels.bench --kernel fp8_blockwise_gemm --warmup 
100 --repeat 30`
-  - nvfp4: `python -m tirx_kernels.bench --kernel nvfp4_gemm --warmup 100 
--repeat 30`
+  - fp16: `python -m tirx_kernels.bench --kernel fp16_bf16_gemm --warmup 100 
--repeat 30 --timer proton`
+  - fp8: `python -m tirx_kernels.bench --kernel fp8_blockwise_gemm --warmup 
100 --repeat 30 --timer proton`
+  - nvfp4: `python -m tirx_kernels.bench --kernel nvfp4_gemm --warmup 100 
--repeat 30 --timer proton`
 - **FA4** (flash_attention4): all registered configs
-  - `python -m tirx_kernels.bench --kernel flash_attention4 --warmup 100 
--repeat 30`
+  - `python -m tirx_kernels.bench --kernel flash_attention4 --warmup 100 
--repeat 30 --timer proton`
 - **MQA logits** (fp8 / fp4): all registered configs
-  - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp8_mqa_logits 
--warmup 100 --repeat 30`
-  - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp4_mqa_logits 
--warmup 100 --repeat 30`
+  - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp8_mqa_logits 
--warmup 100 --repeat 30 --timer proton`
+  - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp4_mqa_logits 
--warmup 100 --repeat 30 --timer proton`
 
 ## Steps
 
@@ -21,9 +21,15 @@ All commands use `--warmup 100 --repeat 30` for ~3-minute 
total runtime with rel
    export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=index,memory.used 
--format=csv,noheader,nounits | sort -t',' -k2 -n | head -1 | cut -d',' -f1 | 
tr -d ' ')
    ```
 
-2. Run benchmarks for each kernel using the commands above.
+2. Record the exact provenance for every implementation in the result table:
+   - `tir`: `git rev-parse HEAD` from the TIR checkout being tested.
+   - `tirx-kernels`: `git rev-parse HEAD` from the kernel checkout used by 
`python -m tirx_kernels.bench`.
+   - Each git-backed baseline implementation repo, such as DeepGEMM, 
FlashInfer, or FlashAttention: repo path and full commit SHA.
+   - Package/system baselines without a local git repo, such as 
`torch-cublas`: package version, package git version when available, CUDA 
version, and any library version that is easy to query.
 
-3. Present results in a table: kernel x config, with times in ms.
+3. Run benchmarks for each kernel using the commands above.
+
+4. Present results in a table: kernel x config, with times in us, followed by 
the provenance block from step 2.
 
 ## When to use
 
@@ -31,165 +37,176 @@ When modifying anything that affects code generation: 
kernels, op dispatches, lo
 
 ## Reference baseline
 
-Captured 2026-05-17 on B200 (sm_100a), GPU 7, `warmup=100 repeat=30`, 
`timer=proton`.
+Captured 2026-05-24 on B200 (sm_100a), physical GPU 2, `warmup=100 repeat=30`, 
`timer=proton`.
 
-- `tir`             @ `587f439c4c` (branch `scope-id`, with `feat(exec-scope): 
infer scope_id extent from sibling defs when omitted` on top of upstream tirx 
`c9ee147baf`)
-- `tirx-kernels`    @ `fdab8ac5` (branch `scope-id`, with `perf(kernel): hoist 
mqa_fp8 warpgroup index` on top of upstream `ae8673c9`)
+Rows that were below `0.95x` in the full run were rerun twice; those rows use 
the median over the full-run measurement plus the two reruns (`runs=3`). Other 
rows use the full-run measurement (`runs=1`). All times are in us. 
`baseline/tirx > 1` means TIRX is faster.
 
-All times in us. `baseline/tirx` > 1 means TIRX faster.
+- raw full-run results: 
`/home/bohanhou/tirx-kernels/.porting/tir_bench_full/20260524T165849Z_gpu2_w100_r30`
+- low-ratio reruns: 
`/home/bohanhou/tirx-kernels/.porting/tir_bench_full/20260524T165849Z_gpu2_w100_r30/reruns_lt095`
 
-### `fp16_bf16_gemm` (baseline=`torch-cublas`)
+Implementation provenance:
 
+- `tir`             : `/home/bohanhou/tir` @ 
`2a3241a267003249bdf45555a5c6bbcc2a03c90b` (commit date 
`2026-05-17T00:46:03-04:00`, `feat(op): add bounded mbarrier wait (#627)`, 
dirty local worktree)
+- `tirx-kernels`    : `/home/bohanhou/tirx-kernels` @ 
`127cd12de4e4962d4499a39d37beb8a6d4105306` (commit date 
`2026-05-17T00:44:53-04:00`, `feat(nymph): add experimental lowering stack and 
acceptance gates (#290)`, dirty local worktree)
 
-| config | torch-cublas | tir | baseline/tirx |
-|---|---:|---:|---:|
-| `fp16_1024x1024x1024` | 5.73us | 16.54us | 0.347 |
-| `fp16_2048x2048x2048` | 16.40us | 27.91us | 0.588 |
-| `fp16_4096x4096x4096` | 95.19us | 94.34us | 1.009 |
-| `fp16_8192x8192x8192` | 823.15us | 843.04us | 0.976 |
-| `fp16_16384x16384x16384` | 6093.33us | 6128.95us | 0.994 |
-| `bf16_1024x1024x1024` | 5.72us | 16.51us | 0.347 |
-| `bf16_2048x2048x2048` | 16.13us | 27.77us | 0.581 |
-| `bf16_4096x4096x4096` | 92.25us | 91.35us | 1.010 |
-| `bf16_8192x8192x8192` | 756.17us | 781.91us | 0.967 |
-| `bf16_16384x16384x16384` | 5823.27us | 5809.98us | 1.002 |
+Baseline implementation provenance:
 
-### `fp8_blockwise_gemm` (baseline=`deepgemm`)
+- `torch-cublas`    : PyTorch `2.11.0+cu130`, 
`torch.version.git_version=70d99e998b4955e0049d13a98d77ae1b14db1f45`, 
`torch.version.cuda=13.0`
+- `deepgemm`        : `/home/bohanhou/DeepGEMM` @ 
`714dd1a4a980f7937a74343d19a8eba4fe321480` (commit date 
`2026-05-11T19:20:18+08:00`, `Update test_mega_moe.py`)
+- `flashinfer`      : `/home/bohanhou/flashinfer` @ 
`bff85f3459707d5d2f1426d1ded4a320ab142078` (commit date 
`2026-05-22T14:36:34-07:00`, `feat: integrate cute-dsl Blackwell GQA decode 
into BatchDecodeWithPagedKVCacheWrapper (#3360)`, 
`flashinfer.__version__=0.6.11.post1`)
+- `flashattn_sm100` : `/home/bohanhou/flash-attention` @ 
`3da76cdb8aedd842c46511e5194f5f20cdd4cf6f` (commit date 
`2026-05-22T16:00:00-07:00`, `Build Fix: Update abi3 tag to cp310 and minimum 
python version to 3.10 (#2532)`)
 
+Notes:
 
-| config | deepgemm | tir | baseline/tirx |
-|---|---:|---:|---:|
-| `smoke_1024x1024x1024` | 6.07us | 5.91us | 1.026 |
-| `deepgemm_m4096_n2112_k7168` | 49.86us | 48.96us | 1.018 |
-| `deepgemm_m4096_n576_k7168` | 19.12us | 18.84us | 1.015 |
-| `deepgemm_m4096_n24576_k1536` | 116.18us | 115.68us | 1.004 |
-| `deepgemm_m4096_n32768_k512` | 75.54us | 71.28us | 1.060 |
-| `deepgemm_m4096_n7168_k16384` | 320.22us | 329.80us | 0.971 |
-| `deepgemm_m4096_n4096_k7168` | 83.19us | 82.69us | 1.006 |
-| `deepgemm_m4096_n7168_k2048` | 44.04us | 43.59us | 1.010 |
-| `stress_m8192_n7168_k4096` | 159.30us | 159.99us | 0.996 |
+- `fp16_bf16_gemm` still reports `BASELINE_ERROR: triton: No module named 
'tirx_kernels.gemm._triton_matmul'`; the recorded baseline is `torch-cublas`.
+- `nvfp4_gemm` FlashInfer baseline uses `backend="cutlass"`, `use_nvfp4=True`, 
and `flashinfer.autotune(True)`. `backend="auto"` was not recorded because it 
fails in this environment with mixed CUDA runtime libraries (`libcudart.so.12` 
and `libcudart.so.13`).
 
-### `nvfp4_gemm` (baseline=`flashinfer`)
+### `fp16_bf16_gemm` (baseline=`torch-cublas`)
 
+| config | torch-cublas | tir | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `fp16_1024x1024x1024` | 5.47us | 16.46us | 0.332 | 3 |
+| `fp16_2048x2048x2048` | 18.57us | 27.22us | 0.682 | 3 |
+| `fp16_4096x4096x4096` | 156.08us | 95.04us | 1.642 | 1 |
+| `fp16_8192x8192x8192` | 894.20us | 855.20us | 1.046 | 3 |
+| `fp16_16384x16384x16384` | 6451.67us | 6740.25us | 0.957 | 3 |
+| `bf16_1024x1024x1024` | 5.48us | 16.45us | 0.333 | 3 |
+| `bf16_2048x2048x2048` | 18.36us | 27.08us | 0.678 | 3 |
+| `bf16_4096x4096x4096` | 92.75us | 91.47us | 1.014 | 1 |
+| `bf16_8192x8192x8192` | 871.44us | 780.62us | 1.116 | 1 |
+| `bf16_16384x16384x16384` | 6043.31us | 6270.93us | 0.964 | 1 |
 
-| config | flashinfer | tir | baseline/tirx |
-|---|---:|---:|---:|
-| `1024x1024x1024` | 5.13us | 6.59us | 0.778 |
-| `2048x2048x2048` | 8.39us | 8.84us | 0.950 |
-| `4096x4096x4096` | 32.50us | 30.56us | 1.064 |
-| `8192x8192x8192` | 199.24us | 186.39us | 1.069 |
-| `16384x16384x16384` | 2128.05us | 1511.81us | 1.408 |
+### `fp8_blockwise_gemm` (baseline=`deepgemm`)
 
-### `flash_attention4` (baseline=`flashattn_sm100`)
+| config | deepgemm | tir | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `smoke_1024x1024x1024` | 6.84us | 6.36us | 1.077 | 1 |
+| `deepgemm_m4096_n2112_k7168` | 49.78us | 48.67us | 1.023 | 1 |
+| `deepgemm_m4096_n576_k7168` | 19.44us | 18.87us | 1.030 | 1 |
+| `deepgemm_m4096_n24576_k1536` | 117.35us | 115.60us | 1.015 | 1 |
+| `deepgemm_m4096_n32768_k512` | 75.29us | 71.92us | 1.047 | 1 |
+| `deepgemm_m4096_n7168_k16384` | 328.45us | 314.40us | 1.045 | 1 |
+| `deepgemm_m4096_n4096_k7168` | 83.71us | 83.71us | 1.000 | 1 |
+| `deepgemm_m4096_n7168_k2048` | 44.64us | 43.97us | 1.015 | 1 |
+| `stress_m8192_n7168_k4096` | 161.47us | 161.74us | 0.998 | 1 |
 
+### `nvfp4_gemm` (baseline=`flashinfer`)
 
-| config | flashattn_sm100 | tir | baseline/tirx |
-|---|---:|---:|---:|
-| `s1024_h32kv4` | 20.34us | 20.80us | 0.978 |
-| `s1024_h32kv4_causal` | 19.85us | 19.66us | 1.009 |
-| `s1024_h32kv8` | 20.50us | 20.91us | 0.980 |
-| `s1024_h32kv8_causal` | 19.85us | 19.75us | 1.005 |
-| `s1024_h32kv16` | 20.51us | 21.05us | 0.974 |
-| `s1024_h32kv16_causal` | 20.24us | 20.68us | 0.979 |
-| `s1024_h32kv32` | 20.75us | 21.18us | 0.980 |
-| `s1024_h32kv32_causal` | 21.07us | 22.24us | 0.947 |
-| `s2048_h32kv4` | 59.47us | 60.85us | 0.977 |
-| `s2048_h32kv4_causal` | 39.40us | 37.51us | 1.050 |
-| `s2048_h32kv8` | 60.23us | 61.84us | 0.974 |
-| `s2048_h32kv8_causal` | 39.49us | 37.76us | 1.046 |
-| `s2048_h32kv16` | 60.60us | 62.83us | 0.965 |
-| `s2048_h32kv16_causal` | 39.94us | 38.57us | 1.036 |
-| `s2048_h32kv32` | 61.59us | 63.62us | 0.968 |
-| `s2048_h32kv32_causal` | 40.29us | 42.38us | 0.951 |
-| `s4096_h32kv4` | 203.59us | 204.89us | 0.994 |
-| `s4096_h32kv4_causal` | 114.98us | 111.69us | 1.029 |
-| `s4096_h32kv8` | 204.46us | 207.67us | 0.985 |
-| `s4096_h32kv8_causal` | 116.24us | 112.45us | 1.034 |
-| `s4096_h32kv16` | 208.31us | 211.63us | 0.984 |
-| `s4096_h32kv16_causal` | 117.59us | 113.66us | 1.035 |
-| `s4096_h32kv32` | 211.75us | 216.02us | 0.980 |
-| `s4096_h32kv32_causal` | 118.98us | 122.09us | 0.975 |
-| `s8192_h32kv4` | 816.39us | 818.33us | 0.998 |
-| `s8192_h32kv4_causal` | 429.56us | 420.64us | 1.021 |
-| `s8192_h32kv8` | 795.55us | 852.89us | 0.933 |
-| `s8192_h32kv8_causal` | 411.97us | 440.47us | 0.935 |
-| `s8192_h32kv16` | 779.83us | 841.29us | 0.927 |
-| `s8192_h32kv16_causal` | 412.70us | 399.01us | 1.034 |
-| `s8192_h32kv32` | 784.06us | 821.54us | 0.954 |
-| `s8192_h32kv32_causal` | 459.55us | 420.57us | 1.093 |
+| config | flashinfer | tir | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `1024x1024x1024` | 5.19us | 6.74us | 0.770 | 3 |
+| `2048x2048x2048` | 8.51us | 8.87us | 0.960 | 3 |
+| `4096x4096x4096` | 30.96us | 30.22us | 1.025 | 1 |
+| `8192x8192x8192` | 176.76us | 187.81us | 0.941 | 3 |
+| `16384x16384x16384` | 1673.05us | 1546.78us | 1.082 | 1 |
 
-### `deepgemm_sm100_fp8_mqa_logits` (baseline=`deepgemm`)
+### `flash_attention4` (baseline=`flashattn_sm100`)
 
+| config | flashattn_sm100 | tir | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `s1024_h32kv4` | 20.26us | 20.59us | 0.984 | 1 |
+| `s1024_h32kv4_causal` | 19.30us | 19.16us | 1.007 | 1 |
+| `s1024_h32kv8` | 20.06us | 20.74us | 0.967 | 1 |
+| `s1024_h32kv8_causal` | 19.43us | 19.35us | 1.004 | 1 |
+| `s1024_h32kv16` | 20.34us | 20.91us | 0.973 | 1 |
+| `s1024_h32kv16_causal` | 19.82us | 20.16us | 0.983 | 1 |
+| `s1024_h32kv32` | 20.59us | 21.26us | 0.968 | 1 |
+| `s1024_h32kv32_causal` | 20.50us | 22.11us | 0.927 | 3 |
+| `s2048_h32kv4` | 59.33us | 60.65us | 0.978 | 1 |
+| `s2048_h32kv4_causal` | 38.53us | 36.74us | 1.049 | 1 |
+| `s2048_h32kv8` | 59.55us | 60.89us | 0.978 | 1 |
+| `s2048_h32kv8_causal` | 38.74us | 37.10us | 1.044 | 1 |
+| `s2048_h32kv16` | 60.52us | 62.69us | 0.966 | 1 |
+| `s2048_h32kv16_causal` | 39.33us | 37.86us | 1.039 | 1 |
+| `s2048_h32kv32` | 61.18us | 63.07us | 0.970 | 1 |
+| `s2048_h32kv32_causal` | 40.00us | 41.94us | 0.954 | 1 |
+| `s4096_h32kv4` | 203.30us | 203.98us | 0.997 | 1 |
+| `s4096_h32kv4_causal` | 114.28us | 110.97us | 1.030 | 1 |
+| `s4096_h32kv8` | 204.64us | 212.61us | 0.963 | 1 |
+| `s4096_h32kv8_causal` | 115.19us | 111.56us | 1.032 | 1 |
+| `s4096_h32kv16` | 208.55us | 215.10us | 0.970 | 1 |
+| `s4096_h32kv16_causal` | 116.25us | 113.04us | 1.028 | 1 |
+| `s4096_h32kv32` | 213.89us | 217.60us | 0.983 | 1 |
+| `s4096_h32kv32_causal` | 118.53us | 123.54us | 0.959 | 1 |
+| `s8192_h32kv4` | 850.39us | 837.22us | 1.016 | 1 |
+| `s8192_h32kv4_causal` | 462.08us | 445.77us | 1.037 | 1 |
+| `s8192_h32kv8` | 863.43us | 850.11us | 1.016 | 1 |
+| `s8192_h32kv8_causal` | 427.58us | 399.58us | 1.070 | 1 |
+| `s8192_h32kv16` | 859.06us | 763.47us | 1.125 | 1 |
+| `s8192_h32kv16_causal` | 415.83us | 405.42us | 1.026 | 1 |
+| `s8192_h32kv32` | 833.86us | 873.28us | 0.955 | 1 |
+| `s8192_h32kv32_causal` | 441.40us | 455.03us | 0.970 | 1 |
 
-| config | deepgemm | tirx | baseline/tirx |
-|---|---:|---:|---:|
-| `s2048_skv4096_h64_d128_f32_dense_cp` | 43.80us | 44.49us | 0.984 |
-| `s2048_skv4096_h64_d128_f32_dense_nocp` | 58.50us | 58.59us | 0.999 |
-| `s2048_skv8192_h64_d128_f32_dense_cp` | 77.25us | 78.07us | 0.990 |
-| `s2048_skv8192_h64_d128_f32_dense_nocp` | 118.40us | 118.97us | 0.995 |
-| `s4096_skv4096_h64_d128_f32_dense_cp` | 78.02us | 77.94us | 1.001 |
-| `s4096_skv4096_h64_d128_f32_dense_nocp` | 77.89us | 78.37us | 0.994 |
-| `s4096_skv8192_h64_d128_f32_dense_cp` | 136.98us | 136.12us | 1.006 |
-| `s4096_skv8192_h64_d128_f32_dense_nocp` | 196.36us | 202.57us | 0.969 |
-| `s2048_skv4096_h64_d128_f32_compressed_cp` | 46.60us | 44.88us | 1.038 |
-| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 61.46us | 59.54us | 1.032 |
-| `s2048_skv8192_h64_d128_f32_compressed_cp` | 81.83us | 78.99us | 1.036 |
-| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 125.40us | 120.15us | 1.044 |
-| `s4096_skv4096_h64_d128_f32_compressed_cp` | 83.89us | 78.42us | 1.070 |
-| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 83.94us | 78.89us | 1.064 |
-| `s4096_skv8192_h64_d128_f32_compressed_cp` | 147.25us | 137.97us | 1.067 |
-| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 209.79us | 196.89us | 1.066 |
-| `s2048_skv4096_h64_d128_bf16_dense_cp` | 44.73us | 44.81us | 0.998 |
-| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 58.90us | 59.29us | 0.993 |
-| `s2048_skv8192_h64_d128_bf16_dense_cp` | 79.48us | 79.03us | 1.006 |
-| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 121.27us | 121.16us | 1.001 |
-| `s4096_skv4096_h64_d128_bf16_dense_cp` | 78.87us | 78.84us | 1.000 |
-| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 79.02us | 78.66us | 1.005 |
-| `s4096_skv8192_h64_d128_bf16_dense_cp` | 139.18us | 138.40us | 1.006 |
-| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 199.50us | 197.53us | 1.010 |
-| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 46.91us | 46.09us | 1.018 |
-| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 61.15us | 60.29us | 1.014 |
-| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 82.17us | 80.09us | 1.026 |
-| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 126.02us | 123.97us | 1.017 |
-| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 84.10us | 82.16us | 1.024 |
-| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 83.94us | 82.05us | 1.023 |
-| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 147.98us | 144.28us | 1.026 |
-| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 209.74us | 204.18us | 1.027 |
+### `deepgemm_sm100_fp8_mqa_logits` (baseline=`deepgemm`)
 
-### `deepgemm_sm100_fp4_mqa_logits` (baseline=`deepgemm`)
+| config | deepgemm | tirx | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `s2048_skv4096_h64_d128_f32_dense_cp` | 44.00us | 44.49us | 0.989 | 1 |
+| `s2048_skv4096_h64_d128_f32_dense_nocp` | 57.69us | 58.43us | 0.987 | 1 |
+| `s2048_skv8192_h64_d128_f32_dense_cp` | 77.33us | 77.38us | 0.999 | 1 |
+| `s2048_skv8192_h64_d128_f32_dense_nocp` | 117.73us | 118.44us | 0.994 | 1 |
+| `s4096_skv4096_h64_d128_f32_dense_cp` | 75.52us | 75.77us | 0.997 | 1 |
+| `s4096_skv4096_h64_d128_f32_dense_nocp` | 75.57us | 76.16us | 0.992 | 1 |
+| `s4096_skv8192_h64_d128_f32_dense_cp` | 133.81us | 133.16us | 1.005 | 1 |
+| `s4096_skv8192_h64_d128_f32_dense_nocp` | 192.41us | 201.40us | 0.955 | 1 |
+| `s2048_skv4096_h64_d128_f32_compressed_cp` | 46.44us | 44.28us | 1.049 | 1 |
+| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 60.38us | 59.15us | 1.021 | 1 
|
+| `s2048_skv8192_h64_d128_f32_compressed_cp` | 80.81us | 78.34us | 1.032 | 1 |
+| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 124.10us | 119.19us | 1.041 | 
1 |
+| `s4096_skv4096_h64_d128_f32_compressed_cp` | 82.16us | 77.14us | 1.065 | 1 |
+| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 82.26us | 76.78us | 1.071 | 1 
|
+| `s4096_skv8192_h64_d128_f32_compressed_cp` | 145.29us | 134.59us | 1.080 | 1 
|
+| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 206.64us | 193.31us | 1.069 | 
1 |
+| `s2048_skv4096_h64_d128_bf16_dense_cp` | 44.26us | 44.85us | 0.987 | 1 |
+| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 58.90us | 59.06us | 0.997 | 1 |
+| `s2048_skv8192_h64_d128_bf16_dense_cp` | 78.41us | 78.09us | 1.004 | 1 |
+| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 120.82us | 119.50us | 1.011 | 1 |
+| `s4096_skv4096_h64_d128_bf16_dense_cp` | 77.42us | 76.89us | 1.007 | 1 |
+| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 77.03us | 77.42us | 0.995 | 1 |
+| `s4096_skv8192_h64_d128_bf16_dense_cp` | 136.75us | 135.73us | 1.008 | 1 |
+| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 196.20us | 193.79us | 1.012 | 1 |
+| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 46.34us | 45.61us | 1.016 | 1 |
+| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 60.94us | 59.65us | 1.022 | 
1 |
+| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 80.72us | 79.53us | 1.015 | 1 |
+| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 124.13us | 121.22us | 1.024 
| 1 |
+| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 82.07us | 80.63us | 1.018 | 1 |
+| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 82.30us | 80.60us | 1.021 | 
1 |
+| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 145.44us | 141.84us | 1.025 | 
1 |
+| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 206.94us | 202.45us | 1.022 
| 1 |
 
+### `deepgemm_sm100_fp4_mqa_logits` (baseline=`deepgemm`)
 
-| config | deepgemm | tirx | baseline/tirx |
-|---|---:|---:|---:|
-| `s2048_skv4096_h64_d128_f32_dense_cp` | 41.25us | 41.52us | 0.994 |
-| `s2048_skv4096_h64_d128_f32_dense_nocp` | 53.67us | 54.10us | 0.992 |
-| `s2048_skv8192_h64_d128_f32_dense_cp` | 71.99us | 72.44us | 0.994 |
-| `s2048_skv8192_h64_d128_f32_dense_nocp` | 111.41us | 111.13us | 1.003 |
-| `s4096_skv4096_h64_d128_f32_dense_cp` | 73.25us | 73.47us | 0.997 |
-| `s4096_skv4096_h64_d128_f32_dense_nocp` | 73.21us | 73.52us | 0.996 |
-| `s4096_skv8192_h64_d128_f32_dense_cp` | 130.21us | 129.54us | 1.005 |
-| `s4096_skv8192_h64_d128_f32_dense_nocp` | 186.20us | 184.96us | 1.007 |
-| `s2048_skv4096_h64_d128_f32_compressed_cp` | 45.14us | 42.37us | 1.066 |
-| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 59.05us | 54.82us | 1.077 |
-| `s2048_skv8192_h64_d128_f32_compressed_cp` | 79.09us | 73.69us | 1.073 |
-| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 122.95us | 113.08us | 1.087 |
-| `s4096_skv4096_h64_d128_f32_compressed_cp` | 80.41us | 73.88us | 1.088 |
-| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 80.32us | 73.81us | 1.088 |
-| `s4096_skv8192_h64_d128_f32_compressed_cp` | 144.14us | 131.25us | 1.098 |
-| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 206.26us | 187.68us | 1.099 |
-| `s2048_skv4096_h64_d128_bf16_dense_cp` | 42.24us | 42.51us | 0.994 |
-| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 55.24us | 55.44us | 0.996 |
-| `s2048_skv8192_h64_d128_bf16_dense_cp` | 74.32us | 74.16us | 1.002 |
-| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 114.28us | 113.84us | 1.004 |
-| `s4096_skv4096_h64_d128_bf16_dense_cp` | 74.91us | 74.90us | 1.000 |
-| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 74.90us | 74.84us | 1.001 |
-| `s4096_skv8192_h64_d128_bf16_dense_cp` | 133.11us | 132.55us | 1.004 |
-| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 190.79us | 189.49us | 1.007 |
-| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 44.99us | 45.73us | 0.984 |
-| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 59.06us | 60.01us | 0.984 |
-| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 79.27us | 80.35us | 0.987 |
-| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 122.57us | 123.86us | 0.990 |
-| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 79.93us | 81.00us | 0.987 |
-| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 79.78us | 80.97us | 0.985 |
-| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 142.89us | 144.28us | 0.990 |
-| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 204.95us | 206.88us | 0.991 |
+| config | deepgemm | tirx | baseline/tirx | runs |
+|---|---:|---:|---:|---:|
+| `s2048_skv4096_h64_d128_f32_dense_cp` | 39.78us | 40.41us | 0.984 | 1 |
+| `s2048_skv4096_h64_d128_f32_dense_nocp` | 51.63us | 52.15us | 0.990 | 1 |
+| `s2048_skv8192_h64_d128_f32_dense_cp` | 68.57us | 69.27us | 0.990 | 1 |
+| `s2048_skv8192_h64_d128_f32_dense_nocp` | 105.31us | 105.61us | 0.997 | 1 |
+| `s4096_skv4096_h64_d128_f32_dense_cp` | 69.83us | 69.95us | 0.998 | 1 |
+| `s4096_skv4096_h64_d128_f32_dense_nocp` | 69.74us | 69.72us | 1.000 | 1 |
+| `s4096_skv8192_h64_d128_f32_dense_cp` | 123.12us | 122.59us | 1.004 | 1 |
+| `s4096_skv8192_h64_d128_f32_dense_nocp` | 175.80us | 174.99us | 1.005 | 1 |
+| `s2048_skv4096_h64_d128_f32_compressed_cp` | 44.43us | 41.09us | 1.081 | 1 |
+| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 58.08us | 53.10us | 1.094 | 1 
|
+| `s2048_skv8192_h64_d128_f32_compressed_cp` | 77.15us | 70.58us | 1.093 | 1 |
+| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 119.08us | 107.61us | 1.107 | 
1 |
+| `s4096_skv4096_h64_d128_f32_compressed_cp` | 77.49us | 70.97us | 1.092 | 1 |
+| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 77.47us | 70.86us | 1.093 | 1 
|
+| `s4096_skv8192_h64_d128_f32_compressed_cp` | 138.42us | 124.58us | 1.111 | 1 
|
+| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 198.99us | 177.86us | 1.119 | 
1 |
+| `s2048_skv4096_h64_d128_bf16_dense_cp` | 42.04us | 41.30us | 1.018 | 1 |
+| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 54.48us | 53.58us | 1.017 | 1 |
+| `s2048_skv8192_h64_d128_bf16_dense_cp` | 72.25us | 71.16us | 1.015 | 1 |
+| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 110.92us | 108.73us | 1.020 | 1 |
+| `s4096_skv4096_h64_d128_bf16_dense_cp` | 73.97us | 71.40us | 1.036 | 1 |
+| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 73.97us | 71.40us | 1.036 | 1 |
+| `s4096_skv8192_h64_d128_bf16_dense_cp` | 129.30us | 126.49us | 1.022 | 1 |
+| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 185.61us | 180.39us | 1.029 | 1 |
+| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 42.89us | 41.88us | 1.024 | 1 |
+| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 55.43us | 54.18us | 1.023 | 
1 |
+| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 73.95us | 72.32us | 1.023 | 1 |
+| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 113.15us | 110.48us | 1.024 
| 1 |
+| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 75.31us | 72.50us | 1.039 | 1 |
+| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 75.28us | 72.66us | 1.036 | 
1 |
+| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 132.59us | 128.38us | 1.033 | 
1 |
+| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 189.25us | 183.46us | 1.032 
| 1 |

Reply via email to