This is an automated email from the ASF dual-hosted git repository. spectrometerHBH pushed a commit to branch tir-bench in repository https://gitbox.apache.org/repos/asf/tvm.git
commit cd8790bc5432240aea8bbc72058521d3f80d8801 Author: spectrometerHBH <[email protected]> AuthorDate: Fri May 29 00:44:36 2026 -0400 feat(infra): util-gated GPU selection for tir-bench + per-config warmup/repeat - run.py: pick/skip GPUs by utilization.gpu so idle-but-resident cards (held VRAM at ~0% util) are shareable; detect interference via per-PID sm% (nvidia-smi pmon) instead of mere compute-app PID presence - run.py: remove --gpus; GPU selection is fully automatic - tir-bench.md: document that selection is automatic (never pin a GPU) - workloads.yaml: pin bandwidth-bound 16384^3 GEMMs (fp16/bf16/nvfp4) to warmup10/repeat30 (short exposure beats more iters under contention) - gitignore: ignore regenerable .tir-bench/ run artifacts --- .claude/commands/tir-bench.md | 20 ++- .claude/commands/tir-bench/run.py | 224 ++++++++++++++++++++++-------- .claude/commands/tir-bench/workloads.yaml | 6 +- .gitignore | 4 + 4 files changed, 189 insertions(+), 65 deletions(-) diff --git a/.claude/commands/tir-bench.md b/.claude/commands/tir-bench.md index 90a866f927..4c020d5315 100644 --- a/.claude/commands/tir-bench.md +++ b/.claude/commands/tir-bench.md @@ -1,6 +1,6 @@ --- -description: "Pre-commit kernel regression benchmark (free-GPU polling + parallel sweep)" -argument-hint: "[--filter SUBSTR] [--gpus 0,1] [--baseline PATH] [--threshold PCT] [--label STR]" +description: "Pre-commit kernel regression benchmark (auto GPU selection + parallel sweep)" +argument-hint: "[--filter SUBSTR] [--baseline PATH] [--threshold PCT] [--label STR] [--util-threshold PCT]" allowed-tools: ["Bash", "Read"] --- @@ -9,9 +9,19 @@ allowed-tools: ["Bash", "Read"] Run the curated workload list in `.claude/commands/tir-bench/workloads.yaml` on every free GPU in parallel, dump JSON, and diff against the previous run. -Methodology mirrors `tirx-bench-ci/bench-run.sh`: poll `nvidia-smi ---query-compute-apps` to find unoccupied GPUs, hand out one workload per -GPU under an in-process lock, fall back to a 5 s retry when all are busy. +> **NEVER ACTIVELY SELECT A GPU FOR THIS run.py — IT SELECTS GPUs AUTOMATICALLY.** +> There is no `--gpus` flag. Do not set `CUDA_VISIBLE_DEVICES` to pin cards either. +> run.py probes every visible GPU, then on each acquire scans utilization and +> picks any card below `--util-threshold` (skipping cards in active use, +> requeueing if a neighbor bursts mid-run). Manually pinning defeats this and +> can land work on a busy card. If the machine is contended, let it run — busy +> cards are skipped and re-tried automatically; just re-run later for full coverage. + +Methodology mirrors `tirx-bench-ci/bench-run.sh`: poll per-GPU +`utilization.gpu` to find idle cards (a card merely holding resident VRAM at +low util still counts as free), hand out one workload per GPU under an +in-process lock, and fall back to a 5 s retry when all are busy. Interference +is detected via per-PID `sm%` (`nvidia-smi pmon`) and the workload requeued. **Args forwarded to run.py:** `$ARGUMENTS` diff --git a/.claude/commands/tir-bench/run.py b/.claude/commands/tir-bench/run.py index c6dd95e5ac..c2d06ba30d 100644 --- a/.claude/commands/tir-bench/run.py +++ b/.claude/commands/tir-bench/run.py @@ -1,14 +1,17 @@ #!/usr/bin/env python3 """tir-bench: pre-commit regression benchmark for TIRx kernels. -Methodology mirrors tirx-bench-ci's bench-run.sh: nvidia-smi compute-apps -polling + a lock to assign at most one workload per free GPU. Differences +Methodology mirrors tirx-bench-ci's bench-run.sh: per-GPU utilization +polling + a lock to assign at most one workload per free GPU. GPU selection +is fully automatic — there is no --gpus flag on purpose (a human pinning +cards defeats the util gate and can land work on a busy card). Differences from bench-ci: no build phase, no SQLite, no worktrees — we test the working tree as-is and emit JSON + a markdown regression report. Usage: - python run.py [--workloads PATH] [--filter SUBSTR] [--gpus 0,1,2] + python run.py [--workloads PATH] [--filter SUBSTR] [--baseline PATH] [--threshold PCT] [--label STR] + [--util-threshold PCT] Exit codes: 0 no regressions (or no baseline yet) @@ -39,6 +42,13 @@ DEFAULT_BASELINE = SCRIPT_DIR / "baseline.json" # pinned reference; user `cp <n POLL_INTERVAL = 5.0 # seconds between GPU re-checks when none is free MONITOR_INTERVAL = 0.5 # seconds between nvidia-smi polls during a workload MAX_INTERFERED_RETRIES = 5 # workloads that hit INTERFERED get requeued up to this many times +DEFAULT_UTIL_THRESHOLD = 10.0 # % GPU util at/above which a card counts as "actively computing" +# Why util, not PID-presence: on shared boxes other tenants routinely *park* +# processes that hold tens-to-hundreds of GiB of VRAM at 0% utilization. They +# aren't competing for SMs, so co-running our bench on such a card is fine. +# Gating on "any compute-app PID present" would reject every such card and +# starve the sweep; gating on utilization lets us share idle-but-resident cards +# while still avoiding cards where a neighbor is actually burning the GPU. # Tiny real workload used to decide whether a GPU is actually usable. # Catches: driver hangs, ECC errors when touching memory, cuBLAS init @@ -82,16 +92,24 @@ def load_workloads(path: Path) -> list[dict]: class GpuPool: """Hand out free GPU indices to worker threads. - Every acquire() re-queries nvidia-smi for memory/util to decide who is - free right now, so a GPU that was busy at sweep start and freed up later - is still reusable. The broken-card probe is a separate startup step; - by the time the pool is built, `allowed` already excludes broken cards. + Every acquire() re-queries nvidia-smi utilization to decide who is free + right now: a card counts as taken only if its GPU utilization is at/above + `util_threshold` (someone is actively computing) — a card merely *holding* + VRAM at 0% util is fair game to co-run on. So a GPU that was pegged at + sweep start and went idle later is reusable the moment its util drops. The + broken-card probe is a separate startup step; by the time the pool is + built, `allowed` already excludes broken cards. """ - def __init__(self, allowed: set[str] | None = None): + def __init__( + self, + allowed: set[str] | None = None, + util_threshold: float = DEFAULT_UTIL_THRESHOLD, + ): self._owned: set[str] = set() self._lock = threading.Lock() self._allowed = allowed + self.util_threshold = util_threshold @staticmethod def _nvidia_smi(args: list[str]) -> list[str]: @@ -111,13 +129,32 @@ class GpuPool: return result def _busy_indices(self) -> set[str]: - """GPU indices with at least one compute-app PID (anyone's). When - the container exposes cross-namespace process visibility, this is - a clean, threshold-free signal — ours and theirs both show up.""" + """GPU indices with at least one compute-app PID (anyone's). Kept for + the informational startup banner only — selection uses _occupied_indices + (utilization), since a PID may just be parking idle VRAM.""" rows = self._nvidia_smi(["--query-compute-apps=gpu_uuid"]) busy_uuids = {l for l in rows if l} return {idx for idx, uuid in self._all_gpus() if uuid in busy_uuids} + def _utils(self) -> dict[str, float]: + """Map GPU index -> current utilization.gpu (percent).""" + rows = self._nvidia_smi(["--query-gpu=index,utilization.gpu"]) + out: dict[str, float] = {} + for line in rows: + parts = [p.strip() for p in line.split(",")] + if len(parts) >= 2: + try: + out[parts[0]] = float(parts[1]) + except ValueError: + pass + return out + + def _occupied_indices(self) -> set[str]: + """GPU indices actively computing (util >= threshold) — i.e. a real + tenant is burning the GPU, so we should not co-run there. Idle cards + holding only resident VRAM read ~0% util and are NOT occupied.""" + return {idx for idx, u in self._utils().items() if u >= self.util_threshold} + def total_visible(self) -> int: gpus = self._all_gpus() if self._allowed is not None: @@ -127,17 +164,18 @@ class GpuPool: def acquire(self) -> str: """Block until a free GPU is found; return its index string. - Re-queries nvidia-smi memory/util on every loop iteration so that a - GPU which was busy when the previous workload acquired now counts as - free if the other tenant has released it. + Re-queries nvidia-smi utilization on every loop iteration so that a + GPU which was pegged when the previous workload acquired now counts as + free once the other tenant's util drops below the threshold. A card + that only holds resident VRAM (0% util) counts as free. """ while True: with self._lock: - busy = self._busy_indices() + occupied = self._occupied_indices() for idx, _uuid in self._all_gpus(): if self._allowed is not None and idx not in self._allowed: continue - if idx in self._owned or idx in busy: + if idx in self._owned or idx in occupied: continue self._owned.add(idx) return idx @@ -272,6 +310,55 @@ def _pids_on_gpu(uuid: str) -> set[int]: return pids +def _pid_sm_on_gpu(gpu_index: str) -> dict[int, float]: + """Map PID -> sm-utilization (%) for every compute process on the given + physical GPU, via `nvidia-smi pmon`. + + This is the signal that separates a neighbor *actively burning the GPU* + from one merely *parking resident VRAM* at 0% sm — and, crucially, it is + per-process, so it stays meaningful while our own kernel pegs the + device-level utilization. A single `pmon -c 1` snapshot is ~0.15s here. + + pmon `-s u` columns: gpu pid type sm mem enc dec jpg ofa command. + Inactive rows show "-" for pid/sm; those are skipped. + """ + try: + out = subprocess.run( + ["nvidia-smi", "pmon", "-i", str(gpu_index), "-c", "1", "-s", "u"], + capture_output=True, text=True, timeout=8, + ).stdout + except Exception: + return {} + result: dict[int, float] = {} + for line in out.splitlines(): + line = line.strip() + if not line or line.startswith("#"): + continue + fields = line.split() + if len(fields) < 4: + continue + try: + pid = int(fields[1]) + sm = float(fields[3]) + except ValueError: + continue # pid or sm is "-" (no active process this sample) + result[pid] = sm + return result + + +def _active_strangers(gpu_index: str, our_pids: set[int], sm_threshold: float) -> dict[int, float]: + """PIDs on `gpu_index` that are NOT ours and whose sm-util >= threshold. + + Empty result == no neighbor is actively computing right now, so an + idle-but-resident squatter (sm 0) does not count as interference and we + are free to share the card.""" + return { + pid: sm + for pid, sm in _pid_sm_on_gpu(gpu_index).items() + if pid not in our_pids and sm >= sm_threshold + } + + def _our_process_tree(root_pid: int) -> set[int]: """Set of PIDs in the process tree rooted at root_pid (inclusive). @@ -311,31 +398,38 @@ def _run_subprocess_monitored( env: dict[str, str], cwd: str, log_path: Path, - gpu_uuid: str, + gpu_index: str, monitor_interval: float, + sm_threshold: float, ) -> tuple[int, bool, list[int]]: - """Spawn `cmd` on the assigned GPU and watch for intruders. + """Spawn `cmd` on the assigned GPU and watch for *active* intruders. Returns (returncode, interfered, intruder_pids). - Two-stage protection: + Interference == another tenant is actually computing on our card, i.e. a + PID that is not in our process tree has sm-utilization >= `sm_threshold`. + A neighbor that only parks resident VRAM at 0% sm is NOT interference — we + deliberately co-run with those (that is the whole point of the util gate). - 1. **Pre-spawn check**: if the GPU has any PID right before we Popen, - someone won the race between pool.acquire() and now. Don't even - launch — return INTERFERED with the squatters' PIDs so the - dispatcher requeues this workload. + Two-stage protection, both using per-PID sm-util (`nvidia-smi pmon`): - 2. **Per-poll ancestry check**: at every `monitor_interval`, take the - set of PIDs on the GPU and subtract our process tree (walked via - /proc PPID chain from our subprocess.pid). Anything left = intruder - → SIGTERM subprocess. No "grace period" guessing — ancestry is the - ground truth for what's ours. + 1. **Pre-spawn check**: if any stranger is already actively computing, + someone grabbed the card between pool.acquire() and now (or an + idle-looking card just woke up). Don't launch — return INTERFERED so + the dispatcher requeues this workload. + + 2. **Per-poll check**: at every `monitor_interval`, take the per-PID sm + map, drop our process tree (walked via /proc PPID chain), and if any + remaining PID is at/above the sm threshold, SIGTERM the subprocess. + This catches a brand-new intruder *and* a resident neighbor that + bursts its own sm mid-run — per-PID sm stays meaningful even while our + own kernel pegs the device-level utilization. """ - if gpu_uuid: - pre = _pids_on_gpu(gpu_uuid) + if gpu_index: + pre = _active_strangers(gpu_index, set(), sm_threshold) if pre: with open(log_path, "w") as lf: - lf.write(f"RACE_LOST: pre-spawn check — GPU already has PIDs {sorted(pre)}\n") + lf.write(f"RACE_LOST: pre-spawn check — active strangers {pre}\n") return -1, True, sorted(pre) with open(log_path, "w") as lf: @@ -348,15 +442,12 @@ def _run_subprocess_monitored( break # subprocess exited normally except subprocess.TimeoutExpired: pass - if not gpu_uuid: + if not gpu_index: continue - on_gpu = _pids_on_gpu(gpu_uuid) - if not on_gpu: - continue # subprocess hasn't initialized CUDA yet, nothing to compare ours = _our_process_tree(proc.pid) - strangers = on_gpu - ours - if strangers: - intruders = sorted(strangers) + active = _active_strangers(gpu_index, ours, sm_threshold) + if active: + intruders = sorted(active) try: proc.terminate() proc.wait(timeout=10) @@ -422,9 +513,12 @@ def run_one( interfered = False intruder_pids: list[int] = [] try: - gpu_uuid = "" if no_monitor else (_gpu_uuid_of(gpu) or "") + # Pass the physical GPU index (not "" ) only when monitoring is on; + # the monitor uses per-PID sm-util (pmon) keyed by this index. + monitor_idx = "" if no_monitor else gpu returncode, interfered, intruder_pids = _run_subprocess_monitored( - cmd, env, workdir, log_path, gpu_uuid, MONITOR_INTERVAL, + cmd, env, workdir, log_path, monitor_idx, MONITOR_INTERVAL, + pool.util_threshold, ) if interfered: record["status"] = "INTERFERED" @@ -896,8 +990,9 @@ def main() -> None: help="Regression threshold in percent slowdown") ap.add_argument("--filter", type=str, default=None, help="Only keep workloads whose kernel contains this substring") - ap.add_argument("--gpus", type=str, default=None, - help="Comma-separated GPU indices (default: all visible)") + # NOTE: there is intentionally no --gpus flag. GPU selection is automatic + # (util-gated probe + per-acquire utilization scan); a human pinning cards + # defeats that and can land work on a busy card. See acquire()/_occupied_indices. ap.add_argument("--label", type=str, default=None, help="Free-form label for this run (default: git short sha)") ap.add_argument("--no-report", action="store_true", @@ -908,6 +1003,12 @@ def main() -> None: help="Per-GPU probe timeout in seconds (default 60)") ap.add_argument("--no-monitor", action="store_true", help="Don't monitor for GPU interference during workloads") + ap.add_argument("--util-threshold", type=float, default=DEFAULT_UTIL_THRESHOLD, + help="%% GPU/sm utilization at/above which a card counts as " + "actively in use: selection skips such cards and the " + "monitor requeues if a neighbor crosses it mid-run. " + "Cards merely holding resident VRAM at lower util are " + f"shared (default {DEFAULT_UTIL_THRESHOLD:g})") args = ap.parse_args() workloads = load_workloads(args.workloads) @@ -942,26 +1043,35 @@ def main() -> None: print(f"[tir-bench] tail : tail -f {latest_log}") print(f"[tir-bench] run id : {stamp}") - allowed = set(args.gpus.split(",")) if args.gpus else None - - # ── Two-step GPU selection ── - # 1. Startup probe: run a tiny fp16 matmul on every --gpus-filtered card + # ── Automatic GPU selection (no manual override on purpose) ── + # 1. Startup probe: run a tiny fp16 matmul on every visible card # (including busy ones — the probe is light, finishes fine on a # contended card; this catches broken drivers / ECC). Probe failures # are banned for the rest of the run. - # 2. Per-workload acquire: re-scan nvidia-smi memory/util every time we - # need a card, pick any probe-OK one that's currently free. A card - # that was busy at sweep start is reusable the moment it frees up. - listing_pool = GpuPool(allowed=allowed) - in_filter = [idx for idx, _ in listing_pool._all_gpus() if allowed is None or idx in allowed] + # 2. Per-workload acquire: re-scan utilization every time we need a card + # and pick any probe-OK one whose util is below --util-threshold. A + # card pegged at sweep start is reusable the moment its util drops; a + # card merely holding resident VRAM at low util is shared right away. + listing_pool = GpuPool(util_threshold=args.util_threshold) + in_filter = [idx for idx, _ in listing_pool._all_gpus()] if not in_filter: - print("[tir-bench] no visible GPUs match --gpus filter.", file=sys.stderr) + print("[tir-bench] no visible GPUs.", file=sys.stderr) sys.exit(1) - busy_now = sorted(listing_pool._busy_indices() & set(in_filter)) + utils_now = listing_pool._utils() + occupied_now = sorted(listing_pool._occupied_indices() & set(in_filter), key=int) + resident = sorted(listing_pool._busy_indices() & set(in_filter), key=int) + util_str = " ".join(f"{i}:{utils_now.get(i, 0):.0f}%" for i in sorted(in_filter, key=int)) + print( + f"[tir-bench] visible: {len(in_filter)} {sorted(in_filter, key=int)}; " + f"util now [{util_str}]", + flush=True, + ) print( - f"[tir-bench] visible: {len(in_filter)} {sorted(in_filter)}; " - f"busy now: {busy_now if busy_now else 'none'} " - f"(any compute-app PID present)", + f"[tir-bench] gate: util-threshold={args.util_threshold:g}% — " + f"occupied (skip): {occupied_now if occupied_now else 'none'}; " + f"shareable incl. idle-but-resident: " + f"{sorted((set(in_filter) - set(occupied_now)), key=int)} " + f"(resident-VRAM cards: {resident if resident else 'none'})", flush=True, ) @@ -978,7 +1088,7 @@ def main() -> None: print(f"[tir-bench] gpu {idx}: {err}", file=sys.stderr) sys.exit(1) - pool = GpuPool(allowed=usable) + pool = GpuPool(allowed=usable, util_threshold=args.util_threshold) n_gpus = len(usable) _repo_git = collect_repo_git() diff --git a/.claude/commands/tir-bench/workloads.yaml b/.claude/commands/tir-bench/workloads.yaml index 0769f50e1a..c8d99584ea 100644 --- a/.claude/commands/tir-bench/workloads.yaml +++ b/.claude/commands/tir-bench/workloads.yaml @@ -111,12 +111,12 @@ workloads: - {kernel: fp16_bf16_gemm, config: fp16_2048x2048x2048} - {kernel: fp16_bf16_gemm, config: fp16_4096x4096x4096} - {kernel: fp16_bf16_gemm, config: fp16_8192x8192x8192} - - {kernel: fp16_bf16_gemm, config: fp16_16384x16384x16384} + - {kernel: fp16_bf16_gemm, config: fp16_16384x16384x16384, warmup: 10, repeat: 30} # big bandwidth-bound GEMM: short exposure beats more iters under contention - {kernel: fp16_bf16_gemm, config: bf16_1024x1024x1024} - {kernel: fp16_bf16_gemm, config: bf16_2048x2048x2048} - {kernel: fp16_bf16_gemm, config: bf16_4096x4096x4096} - {kernel: fp16_bf16_gemm, config: bf16_8192x8192x8192} - - {kernel: fp16_bf16_gemm, config: bf16_16384x16384x16384} + - {kernel: fp16_bf16_gemm, config: bf16_16384x16384x16384, warmup: 10, repeat: 30} # big bandwidth-bound GEMM: short exposure beats more iters under contention # ── fp8_blockwise_gemm - {kernel: fp8_blockwise_gemm, config: smoke_1024x1024x1024} - {kernel: fp8_blockwise_gemm, config: deepgemm_m4096_n2112_k7168} @@ -132,4 +132,4 @@ workloads: - {kernel: nvfp4_gemm, config: 2048x2048x2048} - {kernel: nvfp4_gemm, config: 4096x4096x4096} - {kernel: nvfp4_gemm, config: 8192x8192x8192} - - {kernel: nvfp4_gemm, config: 16384x16384x16384} + - {kernel: nvfp4_gemm, config: 16384x16384x16384, warmup: 10, repeat: 30} # big bandwidth-bound GEMM: short exposure; take median-of-3 (per-run noisy, median stable) diff --git a/.gitignore b/.gitignore index b2b383bdd7..ca3555a73d 100644 --- a/.gitignore +++ b/.gitignore @@ -291,3 +291,7 @@ python/bin/ python/typing_extensions.py python/*.dist-info/ pytest-of-bohanhou/ + +# tir-bench run artifacts (regenerable; see .claude/commands/tir-bench.md) +.tir-bench/ +.tir-bench-*/
