This is an automated email from the ASF dual-hosted git repository.

spectrometerHBH pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git


The following commit(s) were added to refs/heads/main by this push:
     new 46a9a00c8e [DOCS][TIRX] Add in-kernel profiling (CudaProfiler) 
tutorial (#19895)
46a9a00c8e is described below

commit 46a9a00c8e2ba2c69c539a46412bea0f7f5b9ac7
Author: Bohan Hou <[email protected]>
AuthorDate: Fri Jun 26 20:19:46 2026 -0700

    [DOCS][TIRX] Add in-kernel profiling (CudaProfiler) tutorial (#19895)
    
    This adds an in-kernel profiling page to the TIRx native-basics CUDA
    section,
    documenting the existing `tvm.tirx.bench.CudaProfiler`.
    
    The page covers:
    
    - a minimal load / compute / store example using `start` / `end` /
    `finalize`
      markers and a user-supplied `uint64` buffer;
    - decoding the record buffer on the host and exporting a Perfetto trace
    via
      `export_to_perfetto_trace`;
    - the record/tag encoding and the device code each call lowers to (a
      `%globaltimer` read, a leader-only global store, and a block fence);
    - usage notes: one leader per `(block, group)`, buffer sizing, the
    32-bit
      `%globaltimer` wrap, and the per-region cost.
    
    The example is tested end-to-end on a CUDA GPU (B200, sm_100). It is
    wired into
    the `native_basics.rst` toctree after "Compiling and inspecting". The
    FlashAttention-4 timeline screenshot is served from `tlc-pack/web-data`
    (`images/tirx/tirx_cudaprofiler_fa4.png`), matching the other tirx doc
    figures.
---
 docs/tirx/native_basics.rst                |   1 +
 docs/tirx/native_basics/cuda/profiling.rst | 310 +++++++++++++++++++++++++++++
 2 files changed, 311 insertions(+)

diff --git a/docs/tirx/native_basics.rst b/docs/tirx/native_basics.rst
index 6d37388037..62a7623ee3 100644
--- a/docs/tirx/native_basics.rst
+++ b/docs/tirx/native_basics.rst
@@ -67,3 +67,4 @@ kernels reference compile-time parameters inside type 
annotations (see
    native_basics/cuda/control_flow
    native_basics/cuda/threads_sync
    native_basics/cuda/compiling
+   native_basics/cuda/profiling
diff --git a/docs/tirx/native_basics/cuda/profiling.rst 
b/docs/tirx/native_basics/cuda/profiling.rst
new file mode 100644
index 0000000000..81d4cddca7
--- /dev/null
+++ b/docs/tirx/native_basics/cuda/profiling.rst
@@ -0,0 +1,310 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+In-kernel profiling with CudaProfiler
+=====================================
+
+Once a kernel is correct and you have seen how it compiles (see
+:doc:`compiling`), the next question is usually *where the cycles go*. 
Host-side
+timers and ``nsys`` tell you how long a launch took, but not how that time 
splits
+across the regions *inside* one kernel — the TMA loads, the mainloop MMAs, the
+softmax, the epilogue.
+
+``tvm.tirx.bench.CudaProfiler`` is a lightweight, in-kernel event tracer for
+exactly this. You bracket regions of device code with ``start`` / ``end``
+markers; at runtime one leader thread per block stamps the GPU global timer 
into
+a buffer you pass in as an ordinary kernel argument. After the launch you read
+the buffer back and decode it into per-region durations or a Perfetto timeline.
+
+It is *not* zero cost — every event is a ``%globaltimer`` read plus a global
+store, and every thread in the region pays a block fence — so it is a
+profiling/debugging tool, not something you leave on in production.
+
+The kernel
+----------
+
+The kernel below brackets a ``load`` / ``compute`` / ``store`` sequence. The
+``compute`` region runs a 4000-iteration FMA loop so it clearly dominates. 
Events
+are a plain ``enum.Enum`` whose integer values start at 0 and index a names 
list.
+
+.. code-block:: python
+
+    from enum import Enum
+    import numpy as np
+    import tvm
+    from tvm.script import tirx as T
+    from tvm.tirx.bench import CudaProfiler, export_to_perfetto_trace
+
+    NUM_BLOCKS, BLOCK, NUM_GROUPS = 4, 128, 1
+    WRITE_STRIDE = NUM_BLOCKS * NUM_GROUPS   # >= number of (block, group) 
lanes
+    PROF_SIZE = 4096                         # uint64 slots in the profiler 
buffer
+    N = NUM_BLOCKS * BLOCK
+
+    class Ev(Enum):
+        Load = 0
+        Compute = 1
+        Store = 2
+
+    EV_NAMES = ["load", "compute", "store"]
+
+    @T.prim_func
+    def profiled_kernel(out_ptr: T.handle, inp_ptr: T.handle, prof_ptr: 
T.handle):
+        out = T.match_buffer(out_ptr, (N,), "float32")
+        inp = T.match_buffer(inp_ptr, (N,), "float32")
+        prof = T.match_buffer(prof_ptr, (PROF_SIZE,), "uint64")
+        T.device_entry()
+        bid = T.cta_id([NUM_BLOCKS])
+        tid = T.thread_id([BLOCK])
+        idx = bid * BLOCK + tid
+
+        # Construct the profiler inside the kernel; only the leader thread 
writes.
+        p = CudaProfiler(prof, write_stride=WRITE_STRIDE, 
num_groups=NUM_GROUPS,
+                         default_leader=(tid == 0))
+        p.init(0)                  # group_id = 0; also stamps the buffer 
header at slot 0
+
+        p.start(Ev.Load)
+        x: T.f32 = inp[idx]
+        p.end(Ev.Load)
+
+        p.start(Ev.Compute)
+        acc: T.f32 = T.float32(0)
+        for _ in range(4000):
+            acc = acc * T.float32(1.0001) + x
+        p.end(Ev.Compute)
+
+        p.start(Ev.Store)
+        out[idx] = acc
+        p.end(Ev.Store)
+
+        p.finalize()               # mark this (block, group) lane done
+
+Run it and read the trace
+-------------------------
+
+Allocate a zeroed ``uint64`` buffer, pass it as the last argument, then read it
+back. Each record is one ``uint64``: the high 32 bits are the timestamp, the 
low
+32 bits a packed tag, so decoding is plain bit-twiddling on the host.
+
+.. code-block:: python
+
+    dev = tvm.cuda(0)
+    exe = tvm.compile(tvm.IRModule({"main": profiled_kernel}),
+                      target=tvm.target.Target("cuda"), tir_pipeline="tirx")
+
+    inp = tvm.runtime.tensor(np.ones(N, "float32"), device=dev)
+    out = tvm.runtime.tensor(np.zeros(N, "float32"), device=dev)
+    prof = tvm.runtime.tensor(np.zeros(PROF_SIZE, "uint64"), device=dev)
+
+    exe(out, inp, prof)
+    dev.sync()
+
+    prof_np = prof.numpy()
+    opens, spans = {}, {}
+    for i in range(1, len(prof_np)):
+        word = int(prof_np[i])
+        if word == 0:
+            continue
+        ts, tag = word >> 32, word & 0xFFFFFFFF
+        block = (tag >> 12) // NUM_GROUPS
+        event_idx, event_type = (tag >> 2) & 0x3FF, tag & 0x3   # 0=start 
1=end 2=instant 3=finalize
+        if event_type == 0:
+            opens[(block, event_idx)] = ts
+        elif event_type == 1:
+            spans.setdefault(block, []).append((EV_NAMES[event_idx], ts - 
opens[(block, event_idx)]))
+    for block in sorted(spans):
+        print(f"block {block}:", ", ".join(f"{n}={d}ns" for n, d in 
spans[block]))
+
+    export_to_perfetto_trace(prof_np, "cudaprofiler.perfetto-trace", EV_NAMES)
+
+Durations are stable to within a few percent (they shift with GPU clocks)::
+
+    block 0: load=32ns, compute=8704ns, store=64ns
+    block 1: load=96ns, compute=8704ns, store=64ns
+    block 2: load=96ns, compute=8704ns, store=64ns
+    block 3: load=96ns, compute=8704ns, store=64ns
+
+``export_to_perfetto_trace`` writes ``cudaprofiler.perfetto-trace`` from the 
same
+records; drop it onto https://ui.perfetto.dev for an interactive timeline. 
Because
+the timestamps come from the global ``%globaltimer`` (not a per-SM cycle 
counter),
+events from different blocks share one time axis and are directly comparable.
+
+On a real kernel
+----------------
+
+The same markers, sprinkled through a warp-specialized FlashAttention-4 kernel
+(one ``group`` per warp-group via ``num_groups``), produce a per-warp-group
+timeline of the whole pipeline:
+
+.. figure:: 
https://raw.githubusercontent.com/tlc-pack/web-data/main/images/tirx/tirx_cudaprofiler_fa4.png
+   :align: center
+   :alt: FlashAttention-4 in-kernel timeline in Perfetto
+
+   One CTA of an FA4 forward kernel. ``group_0`` issues the TMA loads
+   (``issue-tma-*``), ``group_3`` / ``group_4`` run the softmax pipeline
+   (``softmax-max`` / ``-exp2`` / ``-sum``), and ``group_5`` runs the
+   ``correction`` — the overlap between the producer and consumer warp-groups 
is
+   exactly what intra-kernel profiling is for.
+
+The API
+-------
+
+Construct the profiler **inside** the kernel body and call four methods:
+
+* ``init(group_id)`` — once per thread; ``group_id`` selects the sub-track and
+  stamps the buffer header at slot 0.
+* ``start(event_type, leader=None)`` / ``end(event_type, leader=None)`` — open 
and
+  close a region. Every thread executes them, but only the leader stores a 
record.
+* ``finalize(leader=None)`` — write a terminal record for this lane.
+
+Constructor arguments:
+
+* ``profiler_buffer`` — the ``uint64`` buffer you pass into the kernel.
+* ``write_stride`` — how far each leader advances between writes. Must be 
``>=``
+  the number of ``(block, group)`` lanes so per-lane streams never collide;
+  ``NUM_BLOCKS * NUM_GROUPS`` is the tight value, a persistent-grid kernel uses
+  ``num_sms * num_groups``.
+* ``num_groups`` — independent sub-tracks per block. Use ``1`` for a plain 
kernel;
+  in a warp-specialized kernel give each warp-group its own ``group_id`` and
+  leader so their timelines don't mix.
+* ``default_leader`` — the predicate for the one writing thread (override per 
call
+  with ``leader=``).
+* ``profiler_enabled`` — pass ``False`` (or a false-y ``PrimExpr``) to turn 
every
+  method into a no-op, so you can leave the markers in and compile them out.
+
+``CudaProfiler`` emits ``start`` / ``end`` / ``finalize``; ``instant`` (event 
type
+2) is reserved in the wire format and understood by the decoder, but there is 
no
+method that produces one.
+
+Groups and granularity
+----------------------
+
+A block's threads are partitioned into ``num_groups`` logical *groups*, and the
+trace's unit is one ``(block, group)`` lane — each becomes its own track. The
+partition is yours: a group can be a warp-group, a single warp, or any set of
+threads, and it does **not** have to align to a warp (the recording path has no
+warp-collective op — just a predicated per-thread store and a block fence). Two
+rules:
+
+* a thread joins a group by calling ``init(group_id)``, which points *its* 
write
+  cursor at that group's lane;
+* exactly one thread per group is the leader and actually writes — pick it 
with a
+  predicate that is true for one thread in the group, and it must be a thread 
that
+  called ``init`` for that group.
+
+Because each leader has its own cursor, one ``start`` / ``end`` statement 
records
+into *every* group at once: each leader stamps its own lane.
+
+**Groups as warp-groups.** A 256-thread block is two warp-groups; give each its
+own ``group_id`` and make its first thread the leader. Here the two 
warp-groups do
+different amounts of compute, so their tracks have different durations:
+
+.. code-block:: python
+
+    NUM_GROUPS = 2
+    p = CudaProfiler(prof, write_stride=NUM_BLOCKS * NUM_GROUPS, 
num_groups=NUM_GROUPS,
+                     default_leader=(tid % 128 == 0))   # first thread of each 
warp-group
+    if tid < 128:
+        p.init(0)
+    else:
+        p.init(1)
+    # ... load ...
+    p.start(Ev.Compute)
+    if tid < 128:
+        for _ in range(1000):           # warp-group 0: light
+            acc = acc * T.float32(1.0001) + x
+    else:
+        for _ in range(5000):           # warp-group 1: heavy
+            acc = acc * T.float32(1.0001) + x
+    p.end(Ev.Compute)
+
+::
+
+    block 0 group 0: load=96ns, compute=3040ns,  store=64ns
+    block 0 group 1: load=96ns, compute=10816ns, store=64ns
+    block 1 group 0: load=96ns, compute=3072ns,  store=64ns
+    block 1 group 1: load=128ns, compute=10784ns, store=64ns
+
+**Groups that are not warp multiples.** A 128-thread block split 48 / 48 / 32
+works the same way — the leaders are the base thread of each group, and the
+48-thread groups (1.5 warps, crossing warp boundaries) each record a correct
+track:
+
+.. code-block:: python
+
+    NUM_GROUPS = 3                                  # groups [0, 48) [48, 96) 
[96, 128)
+    p = CudaProfiler(prof, write_stride=NUM_BLOCKS * NUM_GROUPS, 
num_groups=NUM_GROUPS,
+                     default_leader=((tid == 0) | (tid == 48) | (tid == 96)))
+    if tid < 48:
+        p.init(0)
+    elif tid < 96:
+        p.init(1)
+    else:
+        p.init(2)
+
+::
+
+    block 0 group 0: load=96ns, compute=4544ns, store=64ns   # 48 threads (1.5 
warps)
+    block 0 group 1: load=64ns, compute=4512ns, store=96ns   # 48 threads, 
crosses warp lines
+    block 0 group 2: load=64ns, compute=4576ns, store=64ns   # 32 threads
+
+What each call wraps
+--------------------
+
+The methods are thin wrappers around the ``T.cuda.timer_*`` intrinsics, which
+lower to small ``__device__`` helpers emitted into the generated CUDA. The
+profiler keeps two per-thread ``"local"`` scratch slots — the running tag and
+write cursor — and every record is written by:
+
+.. code-block:: c++
+
+    // tvm_builtin_get_timestamp() == asm("mov.u32 %0, %globaltimer_lo;")
+    profiler_buffer[profiler_write_offset[0]] =
+        ((uint64_t)tvm_builtin_get_timestamp() << 32) | (profiler_tag[0] | 
event_bits);
+    profiler_write_offset[0] += profiler_write_stride;   // global store; only 
the leader runs this
+
+``init`` computes ``BLOCK_GROUP_IDX = block_idx * num_groups + group_id``, 
writes
+the header ``profiler_buffer[0] = ((uint64_t)num_groups << 32) | num_blocks`` 
from
+block 0 / ``threadIdx.x == 0``, and seeds this lane's cursor to ``1 +
+BLOCK_GROUP_IDX`` and tag to ``BLOCK_GROUP_IDX << 12``. ``start`` writes the 
record
+(``event_bits = (event << 2) | 0``) then ``__threadfence_block()``; ``end`` 
fences
+then writes (``| 1``); ``finalize`` fences then writes ``0x3``. The fence runs 
on
+*every* thread in the region, only the store is leader-only — that fence is 
what
+brackets the region's memory traffic, and why the markers perturb the kernel.
+
+Usage notes and caveats
+-----------------------
+
+* **Zero the buffer before the launch.** The decoder treats ``0`` as "empty" 
and
+  reads the grid shape from slot 0, which only block 0 / thread 0 writes.
+* **Exactly one leader per (block, group).** Each thread keeps its own cursor,
+  initialized to ``1 + block_group``; two leaders in the same lane write the 
same
+  offsets and clobber each other. Use ``tid == 0`` or lane 0 of the group's 
leader
+  warp.
+* **Call ``init`` once, before any ``start``.** It seeds each thread's tag and
+  cursor; without it both are garbage.
+* **Size ``write_stride`` and the buffer together.** The largest slot a lane
+  touches is ``1 + block_group + (records_per_lane - 1) * write_stride``;
+  over-allocate, unused slots stay ``0`` and are skipped.
+* **``%globaltimer_lo`` is only the low 32 bits of the nanosecond timer.** It 
wraps
+  about every 4.29 s (``2**32`` ns), so a region straddling a wrap decodes to a
+  bogus duration. Resolution is coarse (tens of ns), so very short regions 
read 0
+  or a single tick.
+* **No payload.** ``start`` / ``end`` record only a timestamp and the event id;
+  encode anything extra in the event id (a distinct ``Ev`` member) or in
+  ``num_groups``.
+* **It is not free.** Two stores plus two block fences per region. Profile, 
read
+  the numbers, then build with ``profiler_enabled=False``.

Reply via email to