repeat

bohan Thu, 28 May 2026 21:45:02 -0700

This is an automated email from the ASF dual-hosted git repository.

spectrometerHBH pushed a commit to branch tir-bench
in repository https://gitbox.apache.org/repos/asf/tvm.git


commit cd8790bc5432240aea8bbc72058521d3f80d8801
Author: spectrometerHBH <[email protected]>
AuthorDate: Fri May 29 00:44:36 2026 -0400

    feat(infra): util-gated GPU selection for tir-bench + per-config 
warmup/repeat
    
    - run.py: pick/skip GPUs by utilization.gpu so idle-but-resident cards
      (held VRAM at ~0% util) are shareable; detect interference via per-PID
      sm% (nvidia-smi pmon) instead of mere compute-app PID presence
    - run.py: remove --gpus; GPU selection is fully automatic
    - tir-bench.md: document that selection is automatic (never pin a GPU)
    - workloads.yaml: pin bandwidth-bound 16384^3 GEMMs (fp16/bf16/nvfp4)
      to warmup10/repeat30 (short exposure beats more iters under contention)
    - gitignore: ignore regenerable .tir-bench/ run artifacts
---
 .claude/commands/tir-bench.md             |  20 ++-
 .claude/commands/tir-bench/run.py         | 224 ++++++++++++++++++++++--------
 .claude/commands/tir-bench/workloads.yaml |   6 +-
 .gitignore                                |   4 +
 4 files changed, 189 insertions(+), 65 deletions(-)

diff --git a/.claude/commands/tir-bench.md b/.claude/commands/tir-bench.md
index 90a866f927..4c020d5315 100644
--- a/.claude/commands/tir-bench.md
+++ b/.claude/commands/tir-bench.md
@@ -1,6 +1,6 @@
 ---
-description: "Pre-commit kernel regression benchmark (free-GPU polling + 
parallel sweep)"
-argument-hint: "[--filter SUBSTR] [--gpus 0,1] [--baseline PATH] [--threshold 
PCT] [--label STR]"
+description: "Pre-commit kernel regression benchmark (auto GPU selection + 
parallel sweep)"
+argument-hint: "[--filter SUBSTR] [--baseline PATH] [--threshold PCT] [--label 
STR] [--util-threshold PCT]"
 allowed-tools: ["Bash", "Read"]
 ---
 
@@ -9,9 +9,19 @@ allowed-tools: ["Bash", "Read"]
 Run the curated workload list in `.claude/commands/tir-bench/workloads.yaml`
 on every free GPU in parallel, dump JSON, and diff against the previous run.
 
-Methodology mirrors `tirx-bench-ci/bench-run.sh`: poll `nvidia-smi
---query-compute-apps` to find unoccupied GPUs, hand out one workload per
-GPU under an in-process lock, fall back to a 5 s retry when all are busy.
+> **NEVER ACTIVELY SELECT A GPU FOR THIS run.py — IT SELECTS GPUs 
AUTOMATICALLY.**
+> There is no `--gpus` flag. Do not set `CUDA_VISIBLE_DEVICES` to pin cards 
either.
+> run.py probes every visible GPU, then on each acquire scans utilization and
+> picks any card below `--util-threshold` (skipping cards in active use,
+> requeueing if a neighbor bursts mid-run). Manually pinning defeats this and
+> can land work on a busy card. If the machine is contended, let it run — busy
+> cards are skipped and re-tried automatically; just re-run later for full 
coverage.
+
+Methodology mirrors `tirx-bench-ci/bench-run.sh`: poll per-GPU
+`utilization.gpu` to find idle cards (a card merely holding resident VRAM at
+low util still counts as free), hand out one workload per GPU under an
+in-process lock, and fall back to a 5 s retry when all are busy. Interference
+is detected via per-PID `sm%` (`nvidia-smi pmon`) and the workload requeued.
 
 **Args forwarded to run.py:** `$ARGUMENTS`
 
diff --git a/.claude/commands/tir-bench/run.py 
b/.claude/commands/tir-bench/run.py
index c6dd95e5ac..c2d06ba30d 100644
--- a/.claude/commands/tir-bench/run.py
+++ b/.claude/commands/tir-bench/run.py
@@ -1,14 +1,17 @@
 #!/usr/bin/env python3
 """tir-bench: pre-commit regression benchmark for TIRx kernels.
 
-Methodology mirrors tirx-bench-ci's bench-run.sh: nvidia-smi compute-apps
-polling + a lock to assign at most one workload per free GPU. Differences
+Methodology mirrors tirx-bench-ci's bench-run.sh: per-GPU utilization
+polling + a lock to assign at most one workload per free GPU. GPU selection
+is fully automatic — there is no --gpus flag on purpose (a human pinning
+cards defeats the util gate and can land work on a busy card). Differences
 from bench-ci: no build phase, no SQLite, no worktrees — we test the
 working tree as-is and emit JSON + a markdown regression report.
 
 Usage:
-    python run.py [--workloads PATH] [--filter SUBSTR] [--gpus 0,1,2]
+    python run.py [--workloads PATH] [--filter SUBSTR]
                   [--baseline PATH] [--threshold PCT] [--label STR]
+                  [--util-threshold PCT]
 
 Exit codes:
     0  no regressions (or no baseline yet)
@@ -39,6 +42,13 @@ DEFAULT_BASELINE = SCRIPT_DIR / "baseline.json"  # pinned 
reference; user `cp <n
 POLL_INTERVAL = 5.0       # seconds between GPU re-checks when none is free
 MONITOR_INTERVAL = 0.5    # seconds between nvidia-smi polls during a workload
 MAX_INTERFERED_RETRIES = 5  # workloads that hit INTERFERED get requeued up to 
this many times
+DEFAULT_UTIL_THRESHOLD = 10.0  # % GPU util at/above which a card counts as 
"actively computing"
+# Why util, not PID-presence: on shared boxes other tenants routinely *park*
+# processes that hold tens-to-hundreds of GiB of VRAM at 0% utilization. They
+# aren't competing for SMs, so co-running our bench on such a card is fine.
+# Gating on "any compute-app PID present" would reject every such card and
+# starve the sweep; gating on utilization lets us share idle-but-resident cards
+# while still avoiding cards where a neighbor is actually burning the GPU.
 
 # Tiny real workload used to decide whether a GPU is actually usable.
 # Catches: driver hangs, ECC errors when touching memory, cuBLAS init
@@ -82,16 +92,24 @@ def load_workloads(path: Path) -> list[dict]:
 class GpuPool:
     """Hand out free GPU indices to worker threads.
 
-    Every acquire() re-queries nvidia-smi for memory/util to decide who is
-    free right now, so a GPU that was busy at sweep start and freed up later
-    is still reusable. The broken-card probe is a separate startup step;
-    by the time the pool is built, `allowed` already excludes broken cards.
+    Every acquire() re-queries nvidia-smi utilization to decide who is free
+    right now: a card counts as taken only if its GPU utilization is at/above
+    `util_threshold` (someone is actively computing) — a card merely *holding*
+    VRAM at 0% util is fair game to co-run on. So a GPU that was pegged at
+    sweep start and went idle later is reusable the moment its util drops. The
+    broken-card probe is a separate startup step; by the time the pool is
+    built, `allowed` already excludes broken cards.
     """
 
-    def __init__(self, allowed: set[str] | None = None):
+    def __init__(
+        self,
+        allowed: set[str] | None = None,
+        util_threshold: float = DEFAULT_UTIL_THRESHOLD,
+    ):
         self._owned: set[str] = set()
         self._lock = threading.Lock()
         self._allowed = allowed
+        self.util_threshold = util_threshold
 
     @staticmethod
     def _nvidia_smi(args: list[str]) -> list[str]:
@@ -111,13 +129,32 @@ class GpuPool:
         return result
 
     def _busy_indices(self) -> set[str]:
-        """GPU indices with at least one compute-app PID (anyone's). When
-        the container exposes cross-namespace process visibility, this is
-        a clean, threshold-free signal — ours and theirs both show up."""
+        """GPU indices with at least one compute-app PID (anyone's). Kept for
+        the informational startup banner only — selection uses 
_occupied_indices
+        (utilization), since a PID may just be parking idle VRAM."""
         rows = self._nvidia_smi(["--query-compute-apps=gpu_uuid"])
         busy_uuids = {l for l in rows if l}
         return {idx for idx, uuid in self._all_gpus() if uuid in busy_uuids}
 
+    def _utils(self) -> dict[str, float]:
+        """Map GPU index -> current utilization.gpu (percent)."""
+        rows = self._nvidia_smi(["--query-gpu=index,utilization.gpu"])
+        out: dict[str, float] = {}
+        for line in rows:
+            parts = [p.strip() for p in line.split(",")]
+            if len(parts) >= 2:
+                try:
+                    out[parts[0]] = float(parts[1])
+                except ValueError:
+                    pass
+        return out
+
+    def _occupied_indices(self) -> set[str]:
+        """GPU indices actively computing (util >= threshold) — i.e. a real
+        tenant is burning the GPU, so we should not co-run there. Idle cards
+        holding only resident VRAM read ~0% util and are NOT occupied."""
+        return {idx for idx, u in self._utils().items() if u >= 
self.util_threshold}
+
     def total_visible(self) -> int:
         gpus = self._all_gpus()
         if self._allowed is not None:
@@ -127,17 +164,18 @@ class GpuPool:
     def acquire(self) -> str:
         """Block until a free GPU is found; return its index string.
 
-        Re-queries nvidia-smi memory/util on every loop iteration so that a
-        GPU which was busy when the previous workload acquired now counts as
-        free if the other tenant has released it.
+        Re-queries nvidia-smi utilization on every loop iteration so that a
+        GPU which was pegged when the previous workload acquired now counts as
+        free once the other tenant's util drops below the threshold. A card
+        that only holds resident VRAM (0% util) counts as free.
         """
         while True:
             with self._lock:
-                busy = self._busy_indices()
+                occupied = self._occupied_indices()
                 for idx, _uuid in self._all_gpus():
                     if self._allowed is not None and idx not in self._allowed:
                         continue
-                    if idx in self._owned or idx in busy:
+                    if idx in self._owned or idx in occupied:
                         continue
                     self._owned.add(idx)
                     return idx
@@ -272,6 +310,55 @@ def _pids_on_gpu(uuid: str) -> set[int]:
     return pids
 
 
+def _pid_sm_on_gpu(gpu_index: str) -> dict[int, float]:
+    """Map PID -> sm-utilization (%) for every compute process on the given
+    physical GPU, via `nvidia-smi pmon`.
+
+    This is the signal that separates a neighbor *actively burning the GPU*
+    from one merely *parking resident VRAM* at 0% sm — and, crucially, it is
+    per-process, so it stays meaningful while our own kernel pegs the
+    device-level utilization. A single `pmon -c 1` snapshot is ~0.15s here.
+
+    pmon `-s u` columns: gpu  pid  type  sm  mem  enc  dec  jpg  ofa  command.
+    Inactive rows show "-" for pid/sm; those are skipped.
+    """
+    try:
+        out = subprocess.run(
+            ["nvidia-smi", "pmon", "-i", str(gpu_index), "-c", "1", "-s", "u"],
+            capture_output=True, text=True, timeout=8,
+        ).stdout
+    except Exception:
+        return {}
+    result: dict[int, float] = {}
+    for line in out.splitlines():
+        line = line.strip()
+        if not line or line.startswith("#"):
+            continue
+        fields = line.split()
+        if len(fields) < 4:
+            continue
+        try:
+            pid = int(fields[1])
+            sm = float(fields[3])
+        except ValueError:
+            continue  # pid or sm is "-" (no active process this sample)
+        result[pid] = sm
+    return result
+
+
+def _active_strangers(gpu_index: str, our_pids: set[int], sm_threshold: float) 
-> dict[int, float]:
+    """PIDs on `gpu_index` that are NOT ours and whose sm-util >= threshold.
+
+    Empty result == no neighbor is actively computing right now, so an
+    idle-but-resident squatter (sm 0) does not count as interference and we
+    are free to share the card."""
+    return {
+        pid: sm
+        for pid, sm in _pid_sm_on_gpu(gpu_index).items()
+        if pid not in our_pids and sm >= sm_threshold
+    }
+
+
 def _our_process_tree(root_pid: int) -> set[int]:
     """Set of PIDs in the process tree rooted at root_pid (inclusive).
 
@@ -311,31 +398,38 @@ def _run_subprocess_monitored(
     env: dict[str, str],
     cwd: str,
     log_path: Path,
-    gpu_uuid: str,
+    gpu_index: str,
     monitor_interval: float,
+    sm_threshold: float,
 ) -> tuple[int, bool, list[int]]:
-    """Spawn `cmd` on the assigned GPU and watch for intruders.
+    """Spawn `cmd` on the assigned GPU and watch for *active* intruders.
 
     Returns (returncode, interfered, intruder_pids).
 
-    Two-stage protection:
+    Interference == another tenant is actually computing on our card, i.e. a
+    PID that is not in our process tree has sm-utilization >= `sm_threshold`.
+    A neighbor that only parks resident VRAM at 0% sm is NOT interference — we
+    deliberately co-run with those (that is the whole point of the util gate).
 
-    1. **Pre-spawn check**: if the GPU has any PID right before we Popen,
-       someone won the race between pool.acquire() and now. Don't even
-       launch — return INTERFERED with the squatters' PIDs so the
-       dispatcher requeues this workload.
+    Two-stage protection, both using per-PID sm-util (`nvidia-smi pmon`):
 
-    2. **Per-poll ancestry check**: at every `monitor_interval`, take the
-       set of PIDs on the GPU and subtract our process tree (walked via
-       /proc PPID chain from our subprocess.pid). Anything left = intruder
-       → SIGTERM subprocess. No "grace period" guessing — ancestry is the
-       ground truth for what's ours.
+    1. **Pre-spawn check**: if any stranger is already actively computing,
+       someone grabbed the card between pool.acquire() and now (or an
+       idle-looking card just woke up). Don't launch — return INTERFERED so
+       the dispatcher requeues this workload.
+
+    2. **Per-poll check**: at every `monitor_interval`, take the per-PID sm
+       map, drop our process tree (walked via /proc PPID chain), and if any
+       remaining PID is at/above the sm threshold, SIGTERM the subprocess.
+       This catches a brand-new intruder *and* a resident neighbor that
+       bursts its own sm mid-run — per-PID sm stays meaningful even while our
+       own kernel pegs the device-level utilization.
     """
-    if gpu_uuid:
-        pre = _pids_on_gpu(gpu_uuid)
+    if gpu_index:
+        pre = _active_strangers(gpu_index, set(), sm_threshold)
         if pre:
             with open(log_path, "w") as lf:
-                lf.write(f"RACE_LOST: pre-spawn check — GPU already has PIDs 
{sorted(pre)}\n")
+                lf.write(f"RACE_LOST: pre-spawn check — active strangers 
{pre}\n")
             return -1, True, sorted(pre)
 
     with open(log_path, "w") as lf:
@@ -348,15 +442,12 @@ def _run_subprocess_monitored(
                 break  # subprocess exited normally
             except subprocess.TimeoutExpired:
                 pass
-            if not gpu_uuid:
+            if not gpu_index:
                 continue
-            on_gpu = _pids_on_gpu(gpu_uuid)
-            if not on_gpu:
-                continue  # subprocess hasn't initialized CUDA yet, nothing to 
compare
             ours = _our_process_tree(proc.pid)
-            strangers = on_gpu - ours
-            if strangers:
-                intruders = sorted(strangers)
+            active = _active_strangers(gpu_index, ours, sm_threshold)
+            if active:
+                intruders = sorted(active)
                 try:
                     proc.terminate()
                     proc.wait(timeout=10)
@@ -422,9 +513,12 @@ def run_one(
     interfered = False
     intruder_pids: list[int] = []
     try:
-        gpu_uuid = "" if no_monitor else (_gpu_uuid_of(gpu) or "")
+        # Pass the physical GPU index (not "" ) only when monitoring is on;
+        # the monitor uses per-PID sm-util (pmon) keyed by this index.
+        monitor_idx = "" if no_monitor else gpu
         returncode, interfered, intruder_pids = _run_subprocess_monitored(
-            cmd, env, workdir, log_path, gpu_uuid, MONITOR_INTERVAL,
+            cmd, env, workdir, log_path, monitor_idx, MONITOR_INTERVAL,
+            pool.util_threshold,
         )
         if interfered:
             record["status"] = "INTERFERED"
@@ -896,8 +990,9 @@ def main() -> None:
                     help="Regression threshold in percent slowdown")
     ap.add_argument("--filter", type=str, default=None,
                     help="Only keep workloads whose kernel contains this 
substring")
-    ap.add_argument("--gpus", type=str, default=None,
-                    help="Comma-separated GPU indices (default: all visible)")
+    # NOTE: there is intentionally no --gpus flag. GPU selection is automatic
+    # (util-gated probe + per-acquire utilization scan); a human pinning cards
+    # defeats that and can land work on a busy card. See 
acquire()/_occupied_indices.
     ap.add_argument("--label", type=str, default=None,
                     help="Free-form label for this run (default: git short 
sha)")
     ap.add_argument("--no-report", action="store_true",
@@ -908,6 +1003,12 @@ def main() -> None:
                     help="Per-GPU probe timeout in seconds (default 60)")
     ap.add_argument("--no-monitor", action="store_true",
                     help="Don't monitor for GPU interference during workloads")
+    ap.add_argument("--util-threshold", type=float, 
default=DEFAULT_UTIL_THRESHOLD,
+                    help="%% GPU/sm utilization at/above which a card counts 
as "
+                         "actively in use: selection skips such cards and the "
+                         "monitor requeues if a neighbor crosses it mid-run. "
+                         "Cards merely holding resident VRAM at lower util are 
"
+                         f"shared (default {DEFAULT_UTIL_THRESHOLD:g})")
     args = ap.parse_args()
 
     workloads = load_workloads(args.workloads)
@@ -942,26 +1043,35 @@ def main() -> None:
     print(f"[tir-bench]   tail : tail -f {latest_log}")
     print(f"[tir-bench] run id : {stamp}")
 
-    allowed = set(args.gpus.split(",")) if args.gpus else None
-
-    # ── Two-step GPU selection ──
-    # 1. Startup probe: run a tiny fp16 matmul on every --gpus-filtered card
+    # ── Automatic GPU selection (no manual override on purpose) ──
+    # 1. Startup probe: run a tiny fp16 matmul on every visible card
     #    (including busy ones — the probe is light, finishes fine on a
     #    contended card; this catches broken drivers / ECC). Probe failures
     #    are banned for the rest of the run.
-    # 2. Per-workload acquire: re-scan nvidia-smi memory/util every time we
-    #    need a card, pick any probe-OK one that's currently free. A card
-    #    that was busy at sweep start is reusable the moment it frees up.
-    listing_pool = GpuPool(allowed=allowed)
-    in_filter = [idx for idx, _ in listing_pool._all_gpus() if allowed is None 
or idx in allowed]
+    # 2. Per-workload acquire: re-scan utilization every time we need a card
+    #    and pick any probe-OK one whose util is below --util-threshold. A
+    #    card pegged at sweep start is reusable the moment its util drops; a
+    #    card merely holding resident VRAM at low util is shared right away.
+    listing_pool = GpuPool(util_threshold=args.util_threshold)
+    in_filter = [idx for idx, _ in listing_pool._all_gpus()]
     if not in_filter:
-        print("[tir-bench] no visible GPUs match --gpus filter.", 
file=sys.stderr)
+        print("[tir-bench] no visible GPUs.", file=sys.stderr)
         sys.exit(1)
-    busy_now = sorted(listing_pool._busy_indices() & set(in_filter))
+    utils_now = listing_pool._utils()
+    occupied_now = sorted(listing_pool._occupied_indices() & set(in_filter), 
key=int)
+    resident = sorted(listing_pool._busy_indices() & set(in_filter), key=int)
+    util_str = " ".join(f"{i}:{utils_now.get(i, 0):.0f}%" for i in 
sorted(in_filter, key=int))
+    print(
+        f"[tir-bench] visible: {len(in_filter)} {sorted(in_filter, key=int)}; "
+        f"util now [{util_str}]",
+        flush=True,
+    )
     print(
-        f"[tir-bench] visible: {len(in_filter)} {sorted(in_filter)}; "
-        f"busy now: {busy_now if busy_now else 'none'} "
-        f"(any compute-app PID present)",
+        f"[tir-bench] gate: util-threshold={args.util_threshold:g}% — "
+        f"occupied (skip): {occupied_now if occupied_now else 'none'}; "
+        f"shareable incl. idle-but-resident: "
+        f"{sorted((set(in_filter) - set(occupied_now)), key=int)} "
+        f"(resident-VRAM cards: {resident if resident else 'none'})",
         flush=True,
     )
 
@@ -978,7 +1088,7 @@ def main() -> None:
             print(f"[tir-bench]   gpu {idx}: {err}", file=sys.stderr)
         sys.exit(1)
 
-    pool = GpuPool(allowed=usable)
+    pool = GpuPool(allowed=usable, util_threshold=args.util_threshold)
     n_gpus = len(usable)
 
     _repo_git = collect_repo_git()
diff --git a/.claude/commands/tir-bench/workloads.yaml 
b/.claude/commands/tir-bench/workloads.yaml
index 0769f50e1a..c8d99584ea 100644
--- a/.claude/commands/tir-bench/workloads.yaml
+++ b/.claude/commands/tir-bench/workloads.yaml
@@ -111,12 +111,12 @@ workloads:
   - {kernel: fp16_bf16_gemm, config: fp16_2048x2048x2048}
   - {kernel: fp16_bf16_gemm, config: fp16_4096x4096x4096}
   - {kernel: fp16_bf16_gemm, config: fp16_8192x8192x8192}
-  - {kernel: fp16_bf16_gemm, config: fp16_16384x16384x16384}
+  - {kernel: fp16_bf16_gemm, config: fp16_16384x16384x16384, warmup: 10, 
repeat: 30}  # big bandwidth-bound GEMM: short exposure beats more iters under 
contention
   - {kernel: fp16_bf16_gemm, config: bf16_1024x1024x1024}
   - {kernel: fp16_bf16_gemm, config: bf16_2048x2048x2048}
   - {kernel: fp16_bf16_gemm, config: bf16_4096x4096x4096}
   - {kernel: fp16_bf16_gemm, config: bf16_8192x8192x8192}
-  - {kernel: fp16_bf16_gemm, config: bf16_16384x16384x16384}
+  - {kernel: fp16_bf16_gemm, config: bf16_16384x16384x16384, warmup: 10, 
repeat: 30}  # big bandwidth-bound GEMM: short exposure beats more iters under 
contention
   # ── fp8_blockwise_gemm
   - {kernel: fp8_blockwise_gemm, config: smoke_1024x1024x1024}
   - {kernel: fp8_blockwise_gemm, config: deepgemm_m4096_n2112_k7168}
@@ -132,4 +132,4 @@ workloads:
   - {kernel: nvfp4_gemm, config: 2048x2048x2048}
   - {kernel: nvfp4_gemm, config: 4096x4096x4096}
   - {kernel: nvfp4_gemm, config: 8192x8192x8192}
-  - {kernel: nvfp4_gemm, config: 16384x16384x16384}
+  - {kernel: nvfp4_gemm, config: 16384x16384x16384, warmup: 10, repeat: 30}  # 
big bandwidth-bound GEMM: short exposure; take median-of-3 (per-run noisy, 
median stable)
diff --git a/.gitignore b/.gitignore
index b2b383bdd7..ca3555a73d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -291,3 +291,7 @@ python/bin/
 python/typing_extensions.py
 python/*.dist-info/
 pytest-of-bohanhou/
+
+# tir-bench run artifacts (regenerable; see .claude/commands/tir-bench.md)
+.tir-bench/
+.tir-bench-*/

(tvm) 22/22: feat(infra): util-gated GPU selection for tir-bench + per-config warmup/repeat

Reply via email to