This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git


The following commit(s) were added to refs/heads/main by this push:
     new b8211bfd93 [Docs]Rework Bring Your Own Codegen tutorial and add 
TensorRT example (#19839)
b8211bfd93 is described below

commit b8211bfd93a7f52c27e5f9a877dad4afd015fded
Author: Shushi Hong <[email protected]>
AuthorDate: Fri Jun 19 07:15:49 2026 -0400

    [Docs]Rework Bring Your Own Codegen tutorial and add TensorRT example 
(#19839)
    
    To solve #19682 , this pr reworks BYOC tutorial into two parts driven by
    one shared model:
    
    - "How BYOC works": run a single conv2d+relu through the same
    FuseOpsByPattern -> MergeCompositeFunctions -> RunCodegen flow on both
    the example NPU (a stub, so check shape) and TensorRT (real,
    cross-checked against a CPU build), so the only thing that varies is the
    backend. partition_for_tensorrt is shown as the one-line wrapper for
    those two passes, with the bind_constants / stub-vs-real /
    shape-vs-value contrasts side by side. Add an FP16 example via the
    relax.ext.tensorrt.options pass config and a summary table; drop the
    redundant second NPU section.
    
    - "Deploying a PyTorch model with TensorRT": take a real torch.nn.Module
    through torch.export -> from_exported_program -> partition_for_tensorrt
    -> build for CUDA -> run, cross-checking the GPU output against PyTorch.
    
    This pr also fixes two stale references in the example NPU backend: the
    README and the runtime's \file docstring pointed at
    src/runtime/contrib/example_npu/ but the file lives under
    src/runtime/extra/contrib/example_npu/; and reword the README's "Memory
    constraint checking: Validates tensor sizes" bullet, since
    _check_npu_memory_constraints / _check_npu_quantization are explicit
    placeholders that return True.
    
    Validated end-to-end on a CUDA GPU with TensorRT 10: the example NPU,
    TensorRT, FP16, and PyTorch-deployment cells all run and match their
    references.
---
 cmake/config.cmake                                 |   3 +-
 docs/how_to/tutorials/bring_your_own_codegen.py    | 438 ++++++++++++++-------
 .../relax/backend/contrib/example_npu/README.md    |   4 +-
 .../contrib/example_npu/example_npu_runtime.cc     |   2 +-
 4 files changed, 303 insertions(+), 144 deletions(-)

diff --git a/cmake/config.cmake b/cmake/config.cmake
index dfbe0d2178..068b17fbf0 100644
--- a/cmake/config.cmake
+++ b/cmake/config.cmake
@@ -203,7 +203,8 @@ set(USE_CUBLAS OFF)
 set(USE_SORT ON)
 
 # Whether to build with TensorRT codegen or runtime
-# Examples are available here: docs/deploy/tensorrt.rst.
+# An end-to-end example is available here:
+# docs/how_to/tutorials/bring_your_own_codegen.py.
 #
 # USE_TENSORRT_CODEGEN - Support for compiling a graph where supported 
operators are
 #                        offloaded to TensorRT. OFF/ON
diff --git a/docs/how_to/tutorials/bring_your_own_codegen.py 
b/docs/how_to/tutorials/bring_your_own_codegen.py
index b6039e4930..a0d4534cc4 100644
--- a/docs/how_to/tutorials/bring_your_own_codegen.py
+++ b/docs/how_to/tutorials/bring_your_own_codegen.py
@@ -18,55 +18,72 @@
 """
 .. _tutorial-bring-your-own-codegen:
 
-Bring Your Own Codegen: NPU Backend Example
-===========================================
-
-This tutorial shows how to integrate a custom hardware backend with TVM's
-BYOC framework, using the bundled example NPU backend (CPU emulation, no
-real hardware required) as the worked example.  You will see the key
-concepts needed to offload operations to a custom accelerator: pattern
-registration, graph partitioning, codegen, and runtime dispatch.
-
-NPUs are purpose-built accelerators designed around a fixed set of operations
-common in neural network inference, such as matrix multiplication, convolution,
-and activation functions.
-The example backend's runtime is a *stub*: it logs the dispatch decisions an
+Bring Your Own Codegen
+======================
+
+TVM's Bring Your Own Codegen (BYOC) framework lets you offload parts of a model
+to a custom backend -- a hardware accelerator, an inference library, or your 
own
+kernels -- while TVM compiles the rest.  This tutorial has two parts:
+
+- **How BYOC works** -- we teach the flow with a bundled, hardware-free 
*example
+  NPU* backend and then drive the **same flow** on a real production backend,
+  NVIDIA TensorRT.  Both run a small, hand-written model so every step is
+  visible; the only thing that changes between them is the backend, and that
+  contrast is the lesson.
+- **Deploying a real model** -- we then put it to work, taking an actual 
PyTorch
+  ``nn.Module`` from export through TensorRT and running it on the GPU.
+
+The example NPU is a teaching stub: its runtime logs the dispatch decisions an
 NPU would make (memory tier, execution engine, fusion) but performs no real
-computation, so output buffers are uninitialized.  Assertions in this tutorial
-therefore check shapes, not values.  When you replace the runtime with your
-hardware SDK calls, the same flow produces real results.
-
-**Prerequisites**: Build TVM with ``USE_EXAMPLE_NPU_CODEGEN=ON`` and
-``USE_EXAMPLE_NPU_RUNTIME=ON``.
+computation, so its output buffers are left uninitialized.  We therefore check
+*shapes*, not values, in the NPU sections -- its job is to make every BYOC step
+visible with nothing hidden.  TensorRT then runs the identical flow for real, 
so
+we cross-check its result against a reference.
+
+**Prerequisites**: the example NPU sections need TVM built with
+``USE_EXAMPLE_NPU_CODEGEN=ON`` and ``USE_EXAMPLE_NPU_RUNTIME=ON``; the TensorRT
+sections need ``USE_TENSORRT_CODEGEN=ON``, ``USE_TENSORRT_RUNTIME=ON`` and
+``USE_CUDA=ON`` plus a CUDA GPU and a matching TensorRT install (from NVIDIA's
+``pip install tensorrt`` packages or the TensorRT archive); the final 
deployment
+section also needs PyTorch.  Each section degrades gracefully when its backend 
is
+unavailable.
 """
 
 ######################################################################
-# Overview of the BYOC Flow
+# Overview of the BYOC flow
 # -------------------------
 #
-# The BYOC framework lets you plug a custom backend into TVM's compilation
-# pipeline in four steps:
+# BYOC plugs a custom backend into TVM's compilation pipeline in four steps:
 #
-# 1. **Register patterns** - describe which sequences of Relax ops the
-#    backend can handle.
+# 1. **Register patterns** - describe which sequences of Relax ops the backend
+#    can handle.
 # 2. **Partition the graph** - group matched ops into composite functions.
-# 3. **Run codegen** - lower composite functions to backend-specific
-#    representation (JSON graph for the example NPU).
-# 4. **Execute** - the runtime dispatches composite functions to the
-#    registered backend runtime.
+# 3. **Run codegen** - lower each composite to the backend's representation
+#    (a JSON graph for both backends in this tutorial).
+# 4. **Execute** - the runtime dispatches each composite to the backend.
+#
+# Steps 1 and 2 are pure Python and run anywhere; steps 3 and 4 need the
+# backend's codegen and runtime compiled into TVM, which is why the
+# build-and-run cells below are guarded.
 
 ######################################################################
-# Step 1: Import the backend to register its patterns
-# ---------------------------------------------------
+# Step 1: Import the backends to register their patterns
+# ------------------------------------------------------
 #
-# Importing the module is enough to register all supported patterns with
-# TVM's pattern registry.
+# Importing a backend module registers its patterns with TVM's global registry.
+# Pattern registration is independent of the C++ build -- only codegen and the
+# runtime require the backend to be compiled in -- so we probe each backend and
+# guard the build-and-run cells accordingly.
+
+import os
+import tempfile
 
 import numpy as np
 
 import tvm
-import tvm.relax.backend.contrib.example_npu  # registers patterns
+import tvm.relax.backend.contrib.example_npu
 from tvm import relax
+from tvm.relax.backend.contrib.tensorrt import partition_for_tensorrt
 from tvm.relax.backend.pattern_registry import get_patterns_with_prefix
 from tvm.relax.transform import FuseOpsByPattern, MergeCompositeFunctions, 
RunCodegen
 from tvm.script import relax as R
@@ -75,148 +92,289 @@ has_example_npu_codegen = 
tvm.get_global_func("relax.ext.example_npu", True)
 has_example_npu_runtime = 
tvm.get_global_func("runtime.ExampleNPUJSONRuntimeCreate", True)
 has_example_npu = has_example_npu_codegen and has_example_npu_runtime
 
-target = tvm.target.Target("llvm")
-
-patterns = get_patterns_with_prefix("example_npu")
-print("Registered patterns:", [p.name for p in patterns])
+has_tensorrt_codegen = tvm.get_global_func("relax.ext.tensorrt", True) is not 
None
+_is_trt_runtime_enabled = 
tvm.get_global_func("relax.is_tensorrt_runtime_enabled", True)
+has_tensorrt = (
+    has_tensorrt_codegen and _is_trt_runtime_enabled is not None and 
_is_trt_runtime_enabled()
+)
+has_cuda = tvm.cuda(0).exist
 
 ######################################################################
-# Step 2: Define a model
-# ----------------------
+# Step 2: Define the model
+# ------------------------
 #
-# We use a simple MatMul + ReLU module to illustrate the flow.
+# A single convolution followed by a ReLU.  This one model is used for both
+# backends.
 
 
 @tvm.script.ir_module
-class MatmulReLU:
+class ConvReLU:
     @R.function
     def main(
-        x: R.Tensor((2, 4), "float32"),
-        w: R.Tensor((4, 8), "float32"),
-    ) -> R.Tensor((2, 8), "float32"):
+        data: R.Tensor((1, 3, 32, 32), "float32"),
+        weight: R.Tensor((16, 3, 3, 3), "float32"),
+    ) -> R.Tensor((1, 16, 30, 30), "float32"):
         with R.dataflow():
-            y = relax.op.matmul(x, w)
-            z = relax.op.nn.relu(y)
-            R.output(z)
-        return z
+            conv = relax.op.nn.conv2d(data, weight)
+            out = relax.op.nn.relu(conv)
+            R.output(out)
+        return out
 
 
 ######################################################################
-# Step 3: Partition the graph
-# ---------------------------
-#
-# ``FuseOpsByPattern`` groups ops that match a registered pattern into
-# composite functions, controlled by two flags:
-#
-# - ``bind_constants=False`` keeps weights as function arguments instead
-#   of baking them in, so the host stays in charge of parameter
-#   ownership.
-# - ``annotate_codegen=True`` tags each composite with its backend name
-#   (``example_npu``); without this tag, ``RunCodegen`` has no way to
-#   route the composite to a backend.
-#
-# ``MergeCompositeFunctions`` then consolidates adjacent composites
-# that target the same backend so each group becomes a single external
-# call.  Note that consolidation depends on the patterns themselves: an
-# ``op_a + op_b`` chain only collapses into one composite if a fused
-# pattern (e.g. ``matmul_relu_fused``) was registered for it; otherwise
-# each op stays as its own composite even when both target the same
-# backend.
-
-mod = MatmulReLU
-mod = FuseOpsByPattern(patterns, bind_constants=False, 
annotate_codegen=True)(mod)
-mod = MergeCompositeFunctions()(mod)
-print("After partitioning:")
-print(mod)
-
-######################################################################
-# Step 4: Run codegen
-# -------------------
+# Step 3: Partition for the example NPU
+# -------------------------------------
+#
+# ``FuseOpsByPattern`` groups ops matching a registered pattern into composite
+# functions; ``MergeCompositeFunctions`` then consolidates adjacent composites
+# bound for the same backend into a single external call.  Two flags steer
+# partitioning:
 #
-# ``RunCodegen`` lowers each annotated composite function to the backend's
-# serialization format.  For the example NPU this produces a JSON graph
-# that the C++ runtime can execute.
+# - ``bind_constants=False`` keeps weights as function arguments, so the host
+#   stays in charge of the parameters.  (TensorRT below makes the opposite
+#   choice: it binds weights as constants because it bakes them into its 
engine.)
+# - ``annotate_codegen=True`` wraps each matched composite in a function tagged
+#   with the backend name -- the tag ``RunCodegen`` routes on.  (The follow-up
+#   ``MergeCompositeFunctions`` also attaches this tag when it groups 
composites,
+#   which is why ``partition_for_tensorrt`` below can leave the flag off.)
 #
-# Steps 4 and 5 require TVM to be built with ``USE_EXAMPLE_NPU_CODEGEN=ON``
-# and ``USE_EXAMPLE_NPU_RUNTIME=ON``.
+# The example NPU registers a fused ``conv2d + relu`` pattern with higher
+# priority than the standalone ``conv2d`` pattern, so the two ops collapse 
into a
+# single ``example_npu.conv2d_relu_fused`` composite -- look for it in the
+# printed module.
 
-if has_example_npu:
-    mod = RunCodegen()(mod)
-    print("After codegen:")
-    print(mod)
+npu_patterns = get_patterns_with_prefix("example_npu")
+npu_mod = FuseOpsByPattern(npu_patterns, bind_constants=False, 
annotate_codegen=True)(ConvReLU)
+npu_mod = MergeCompositeFunctions()(npu_mod)
+print("After partitioning for the example NPU:")
+print(npu_mod)
 
-    ######################################################################
-    # Step 5: Build and run
-    # ---------------------
-    #
-    # Build the module for the host target, create a virtual machine, and
-    # execute the compiled function.
+######################################################################
+# Step 4: Codegen, build and run on the example NPU
+# -------------------------------------------------
+#
+# ``RunCodegen`` invokes each annotated composite's backend codegen, replacing 
it
+# with the backend runtime module (here, the NPU's JSON graph); ``relax.build``
+# then compiles the remaining host-side program and links everything.  Because
+# the runtime is a stub that computes nothing, we assert on the output *shape*
+# only -- the values are uninitialized.
 
-    np.random.seed(0)
-    x_np = np.random.randn(2, 4).astype("float32")
-    w_np = np.random.randn(4, 8).astype("float32")
+np.random.seed(0)
+data_np = np.random.randn(1, 3, 32, 32).astype("float32")
+weight_np = np.random.randn(16, 3, 3, 3).astype("float32")
 
-    with tvm.transform.PassContext(opt_level=3):
-        built = relax.build(mod, target)
+if has_example_npu:
+    npu_mod = RunCodegen()(npu_mod)
 
-    vm = relax.VirtualMachine(built, tvm.cpu())
-    result = vm["main"](tvm.runtime.tensor(x_np, tvm.cpu()), 
tvm.runtime.tensor(w_np, tvm.cpu()))
+    with tvm.transform.PassContext(opt_level=3):
+        npu_exec = relax.build(npu_mod, tvm.target.Target("llvm"))
 
-    assert result.numpy().shape == (2, 8)
-    print("Execution completed. Output shape:", result.numpy().shape)
+    npu_vm = relax.VirtualMachine(npu_exec, tvm.cpu())
+    npu_out = npu_vm["main"](
+        tvm.runtime.tensor(data_np, tvm.cpu()), tvm.runtime.tensor(weight_np, 
tvm.cpu())
+    )
+    assert npu_out.numpy().shape == (1, 16, 30, 30)
+    print("Example NPU run completed. Output shape:", npu_out.numpy().shape)
+else:
+    print("Example NPU backend unavailable; skipping its build and run.")
 
 ######################################################################
-# Step 6: Conv2D + ReLU
-# ---------------------
+# The same flow on a real backend: TensorRT
+# -----------------------------------------
+#
+# Steps 1-4 above are the whole mechanism.  Aiming them at a real backend
+# changes very little, so rather than repeat the walkthrough, here is only what
+# differs for NVIDIA TensorRT:
 #
-# The same flow applies to convolution workloads.  Because the fused
-# ``conv2d + relu`` pattern is registered after the standalone
-# ``conv2d`` pattern in ``patterns.py`` (later entries have higher
-# priority), both ops are offloaded as a single composite function.
+# - **Partition in one call.** ``partition_for_tensorrt`` bundles the
+#   ``FuseOpsByPattern`` + ``MergeCompositeFunctions`` you ran by hand, using
+#   TensorRT's own pattern table.
+# - **Weights become constants** (``bind_constants=True``): TensorRT bakes them
+#   into the engine it builds, so bind the parameters before partitioning.
+# - **Real values.** TensorRT actually computes, so we build for CUDA, run on
+#   the GPU, and cross-check against a plain CPU build -- not just the shape.
+
+trt_mod = relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
+trt_mod = partition_for_tensorrt(trt_mod)
+print("After partition_for_tensorrt:")
+print(trt_mod)
 
+######################################################################
+# Build for CUDA, run on the GPU, and compare against the CPU reference.
 
[email protected]_module
-class Conv2dReLU:
-    @R.function
-    def main(
-        x: R.Tensor((1, 3, 32, 32), "float32"),
-        w: R.Tensor((16, 3, 3, 3), "float32"),
-    ) -> R.Tensor((1, 16, 30, 30), "float32"):
-        with R.dataflow():
-            y = relax.op.nn.conv2d(x, w)
-            z = relax.op.nn.relu(y)
-            R.output(z)
-        return z
+if has_tensorrt and has_cuda:
+    dev = tvm.cuda(0)
+    with tvm.transform.PassContext(opt_level=3):
+        trt_exec = relax.build(RunCodegen()(trt_mod), "cuda")
+    trt_out = relax.VirtualMachine(trt_exec, 
dev)["main"](tvm.runtime.tensor(data_np, dev)).numpy()
 
+    cpu_mod = relax.transform.LegalizeOps()(
+        relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
+    )
+    cpu_exec = relax.build(cpu_mod, "llvm")
+    cpu_out = relax.VirtualMachine(cpu_exec, tvm.cpu())["main"](
+        tvm.runtime.tensor(data_np, tvm.cpu())
+    ).numpy()
 
-if has_example_npu:
-    mod2 = Conv2dReLU
-    mod2 = FuseOpsByPattern(patterns, bind_constants=False, 
annotate_codegen=True)(mod2)
-    mod2 = MergeCompositeFunctions()(mod2)
-    mod2 = RunCodegen()(mod2)
+    np.testing.assert_allclose(trt_out, cpu_out, rtol=1e-2, atol=1e-2)
+    print("TensorRT output shape:", trt_out.shape, "- matches the CPU 
reference.")
+else:
+    print("TensorRT/CUDA unavailable; skipping the GPU build and run.")
+
+######################################################################
+# A real backend also exposes knobs the stub does not.  Setting ``use_fp16``
+# through the ``relax.ext.tensorrt.options`` config lets TensorRT pick FP16
+# kernels, trading a little accuracy for speed; nothing else about the flow
+# changes.  (Other options are environment-driven: ``TVM_TENSORRT_USE_INT8``
+# enables INT8 with calibration, ``TVM_TENSORRT_MAX_WORKSPACE_SIZE`` caps the
+# build workspace, and ``TVM_TENSORRT_CACHE_DIR`` caches built engines to disk
+# for reuse across runs.)
+
+if has_tensorrt and has_cuda:
+    fp16_mod = partition_for_tensorrt(
+        relax.transform.BindParams("main", {"weight": weight_np})(ConvReLU)
+    )
+    with tvm.transform.PassContext(
+        opt_level=3, config={"relax.ext.tensorrt.options": {"use_fp16": True}}
+    ):
+        fp16_exec = relax.build(RunCodegen()(fp16_mod), "cuda")
+    fp16_out = relax.VirtualMachine(fp16_exec, tvm.cuda(0))["main"](
+        tvm.runtime.tensor(data_np, tvm.cuda(0))
+    ).numpy()
+
+    np.testing.assert_allclose(fp16_out, cpu_out, rtol=5e-2, atol=5e-2)
+    print("TensorRT FP16 output shape:", fp16_out.shape, "- matches within 
FP16 tolerance.")
+else:
+    print("TensorRT/CUDA unavailable; skipping the FP16 build.")
+
+######################################################################
+# Example NPU vs TensorRT at a glance
+# -----------------------------------
+#
+# The same four-step flow, two backends:
+#
+# =========  ==============================  ==================================
+# Aspect     Example NPU (teaching stub)     TensorRT (real backend)
+# =========  ==============================  ==================================
+# Runtime    logs decisions, no compute      builds and runs an nvinfer engine
+# Output     uninitialized (check shape)     real values (cross-checked vs CPU)
+# Weights    ``bind_constants=False``        ``bind_constants=True`` (baked in)
+# Partition  two passes, by hand             ``partition_for_tensorrt`` one 
call
+# =========  ==============================  ==================================
 
+######################################################################
+# Deploying a PyTorch model with TensorRT
+# ---------------------------------------
+#
+# Everything above used a hand-written ``IRModule`` so each op was visible.  In
+# practice you start from a trained model.  This final section runs the *same*
+# ``partition_for_tensorrt`` flow on a real PyTorch ``nn.Module``, end to end:
+# export it, import it into Relax with the PyTorch frontend (the weights come 
in
+# as constants -- exactly what TensorRT bakes into its engine), partition, 
build
+# for CUDA, and check the GPU result against PyTorch's own output.  Beyond the
+# frontend import, the only difference is that the imported program returns its
+# outputs as a tuple, so we index ``[0]`` for the single result tensor; the
+# partition-build-run flow is otherwise unchanged.
+#
+# This section additionally requires PyTorch.
+
+try:
+    import torch
+    from torch import nn
+
+    has_torch = True
+except ImportError:
+    has_torch = False
+
+if has_torch and has_tensorrt and has_cuda:
+    from tvm.relax.frontend.torch import from_exported_program
+
+    class SmallConvNet(nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.conv1 = nn.Conv2d(3, 8, 3)
+            self.conv2 = nn.Conv2d(8, 16, 3)
+            self.pool = nn.MaxPool2d(2)
+
+        def forward(self, x):
+            x = torch.relu(self.conv1(x))
+            x = self.pool(x)
+            x = torch.relu(self.conv2(x))
+            return x
+
+    torch_model = SmallConvNet().eval()
+    example_input = torch.randn(1, 3, 32, 32)
+    with torch.no_grad():
+        torch_ref = torch_model(example_input).numpy()
+        exported = torch.export.export(torch_model, (example_input,))
+
+    torch_mod = from_exported_program(exported)
+    torch_mod = partition_for_tensorrt(torch_mod)
+    print("After importing and partitioning the PyTorch model:")
+    print(torch_mod)
+
+    torch_dev = tvm.cuda(0)
     with tvm.transform.PassContext(opt_level=3):
-        built2 = relax.build(mod2, target)
+        torch_exec = relax.build(RunCodegen()(torch_mod), "cuda")
+    deployed = relax.VirtualMachine(torch_exec, torch_dev)["main"](
+        tvm.runtime.tensor(example_input.numpy(), torch_dev)
+    )[0].numpy()
 
-    x2_np = np.random.randn(1, 3, 32, 32).astype("float32")
-    w2_np = np.random.randn(16, 3, 3, 3).astype("float32")
+    np.testing.assert_allclose(deployed, torch_ref, rtol=1e-2, atol=1e-2)
+    print("Deployed PyTorch model on TensorRT; output", deployed.shape, 
"matches PyTorch.")
+else:
+    print("PyTorch / TensorRT / CUDA unavailable; skipping the deployment 
example.")
 
-    vm2 = relax.VirtualMachine(built2, tvm.cpu())
-    result2 = vm2["main"](
-        tvm.runtime.tensor(x2_np, tvm.cpu()), tvm.runtime.tensor(w2_np, 
tvm.cpu())
-    )
-    assert result2.numpy().shape == (1, 16, 30, 30)
-    print("Conv2dReLU output shape:", result2.numpy().shape)
+######################################################################
+# Real deployment builds once and reuses the artifact.  Export the compiled
+# module to a shared library, then load and run it later -- in a fresh process,
+# with no PyTorch and no rebuild needed.
+
+if has_torch and has_tensorrt and has_cuda:
+    with tempfile.TemporaryDirectory() as tmpdir:
+        lib_path = os.path.join(tmpdir, "deployed_trt.so")
+        torch_exec.export_library(lib_path)
+        loaded = tvm.runtime.load_module(lib_path)
+        reran = relax.VirtualMachine(loaded, torch_dev)["main"](
+            tvm.runtime.tensor(example_input.numpy(), torch_dev)
+        )[0].numpy()
+        np.testing.assert_allclose(reran, torch_ref, rtol=1e-2, atol=1e-2)
+        print("Reloaded the exported library and reran; output", reran.shape, 
"still matches.")
+else:
+    print("PyTorch / TensorRT / CUDA unavailable; skipping the export/reload 
step.")
+
+######################################################################
+# Notes for real deployments
+# --------------------------
+#
+# - **Operator coverage and fallback.** TensorRT offloads only the ops in its
+#   pattern table (see ``python/tvm/relax/backend/contrib/tensorrt.py``);
+#   anything unsupported simply stays on the host.  Print the partitioned 
module
+#   and look for the ``Codegen: "tensorrt"`` functions to see what was 
offloaded.
+# - **Dynamic shapes.** The builder sets up an optimization profile for a 
dynamic
+#   leading (batch) dimension, so the integration can serve a model exported 
with
+#   a symbolic batch size.
+# - **Engine build cost.** Building a TensorRT engine is slow the first time 
(it
+#   is not a hang).  Set ``TVM_TENSORRT_CACHE_DIR`` to cache built engines to
+#   disk and skip the rebuild on later runs.
 
 ######################################################################
 # Next steps
 # ----------
 #
-# To build a real NPU backend using this example as a starting point:
+# To build your own backend using the example NPU as a starting point:
 #
-# - Replace ``example_npu_runtime.cc`` with your hardware SDK calls.
+# - Replace the stub runtime in
+#   ``src/runtime/extra/contrib/example_npu/example_npu_runtime.cc`` with your
+#   hardware SDK calls.
 # - Extend ``patterns.py`` with the ops your hardware supports.
-# - Add a C++ codegen under ``src/relax/backend/contrib/`` if your
-#   hardware requires a non-JSON serialization format.
-# - Add your cmake module under ``cmake/modules/contrib/`` following
-#   the pattern in ``cmake/modules/contrib/ExampleNPU.cmake``.
+# - Add a C++ codegen under ``src/relax/backend/contrib/`` if your backend 
needs
+#   a non-JSON serialization format.
+# - Add a CMake module under ``cmake/modules/contrib/`` following
+#   ``ExampleNPU.cmake``.
+#
+# For a complete real-backend implementation to study, see the TensorRT
+# integration: the pattern table and ``partition_for_tensorrt`` in
+# ``python/tvm/relax/backend/contrib/tensorrt.py``, the codegen in
+# ``src/relax/backend/contrib/tensorrt/``, and the runtime in
+# ``src/runtime/extra/contrib/tensorrt/``.
diff --git a/python/tvm/relax/backend/contrib/example_npu/README.md 
b/python/tvm/relax/backend/contrib/example_npu/README.md
index 7e88a0ece0..310670fab1 100644
--- a/python/tvm/relax/backend/contrib/example_npu/README.md
+++ b/python/tvm/relax/backend/contrib/example_npu/README.md
@@ -168,7 +168,7 @@ in the TVM build:
 - `__init__.py` - Registers the backend and its BYOC entry points with TVM so 
the compiler can discover and use the example NPU.
 
 ### Runtime Implementation
-- `src/runtime/contrib/example_npu/example_npu_runtime.cc` - C++ runtime 
implementation that handles JSON-based graph execution for the NPU backend.
+- `src/runtime/extra/contrib/example_npu/example_npu_runtime.cc` - C++ runtime 
implementation that handles JSON-based graph execution for the NPU backend.
 
 ### Tests and Examples
 - `tests/python/contrib/test_example_npu.py` - Comprehensive test suite 
containing example IRModules (e.g. `MatmulReLU`, `Conv2dReLU`) and 
demonstrating the complete BYOC flow from pattern registration to runtime 
execution.
@@ -230,7 +230,7 @@ This shows the registered patterns and that matched 
subgraphs were turned into c
 - **Power management**: Support for different power modes (high_performance, 
balanced, low_power)
 
 ### Pattern Matching Features
-- **Memory constraint checking**: Validates tensor sizes against NPU memory 
limits
+- **Memory constraint hooks**: Placeholder checks where a real backend would 
reject tensors that exceed on-chip memory; the example accepts all
 - **Fusion opportunities**: Identifies conv+activation and other beneficial 
fusions
 - **Layout preferences**: NHWC channel-last layouts preferred by NPUs
 
diff --git a/src/runtime/extra/contrib/example_npu/example_npu_runtime.cc 
b/src/runtime/extra/contrib/example_npu/example_npu_runtime.cc
index 0408a3fe9a..a0f1d0970a 100644
--- a/src/runtime/extra/contrib/example_npu/example_npu_runtime.cc
+++ b/src/runtime/extra/contrib/example_npu/example_npu_runtime.cc
@@ -18,7 +18,7 @@
  */
 
 /*!
- * \file src/runtime/contrib/example_npu/example_npu_runtime.cc
+ * \file src/runtime/extra/contrib/example_npu/example_npu_runtime.cc
  * \brief Example NPU runtime demonstrating architectural concepts
  *
  * This runtime demonstrates key NPU architectural patterns:

Reply via email to