This is an automated email from the ASF dual-hosted git repository.

lmzheng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git


The following commit(s) were added to refs/heads/main by this push:
     new 5d33491  [Tutorial] Autoscheduler on ARM devices (#7326)
5d33491 is described below

commit 5d3349104a1dc4b84f9a744aeee9b124df231f04
Author: Thierry Moreau <[email protected]>
AuthorDate: Sun Jan 24 23:38:17 2021 -0800

    [Tutorial] Autoscheduler on ARM devices (#7326)
    
    * arm tuning tutorial
    
    * adjustment to get RPC working
    
    * fix lint
    
    * fix target
    
    * integrate Leandros comments
    
    * dont request remote in CI
    
    * use API from auto_scheduler, not autoTVM and updated comments
    
    * make ci-runnable
    
    * fix the formatting
    
    * address Zhaos comments
    
    * full run stats
    
    * taking Zhaos comments into consideration
---
 tutorials/auto_scheduler/tune_network_arm.py | 421 +++++++++++++++++++++++++++
 1 file changed, 421 insertions(+)

diff --git a/tutorials/auto_scheduler/tune_network_arm.py 
b/tutorials/auto_scheduler/tune_network_arm.py
new file mode 100644
index 0000000..f821c2e
--- /dev/null
+++ b/tutorials/auto_scheduler/tune_network_arm.py
@@ -0,0 +1,421 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Auto-scheduling a Neural Network for ARM CPU
+=============================================
+**Author**: `Thierry Moreau <https://github.com/tmoreau89, Lianmin Zheng 
<https://github.com/merrymercy>>`_
+
+Auto-tuning for specific devices and workloads is critical for getting the
+best performance. This is a tutorial on how to tune a whole neural
+network for ARM CPU with the auto-scheduler via RPC.
+
+To auto-tune a neural network, we partition the network into small subgraphs 
and 
+tune them independently. Each subgraph is treated as one search task.
+A task scheduler slices the time and dynamically allocates time resources to
+these tasks. The task scheduler predicts the impact of each task on the 
end-to-end
+execution time and prioritizes the one that can reduce the execution time the 
most.
+
+For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to
+get the computational DAG in the tensor expression form.
+We then use the auto-scheduler to construct a search space of this DAG and 
search
+for good schedules (low-level optimizations).
+
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which 
relies on
+manual templates to define the search space, the auto-scheduler does not 
require any
+schedule templates. In other words, the auto-scheduler only uses the compute 
declarations
+in :code:`tvm/python/topi` and does not use existing schedule templates.
+
+Note that this tutorial will not run on Windows or recent versions of macOS. To
+get it to run, you will need to wrap the body of this tutorial in a :code:`if
+__name__ == "__main__":` block.
+"""
+
+import numpy as np
+
+import tvm
+from tvm import relay, auto_scheduler
+import tvm.relay.testing
+from tvm.contrib import graph_runtime
+from tvm.contrib.utils import tempdir
+
+#################################################################
+# Define a Network
+# ----------------
+# First, we need to define the network with relay frontend API.
+# We can load some pre-defined network from :code:`tvm.relay.testing`.
+# We can also load models from MXNet, ONNX, PyTorch, and TensorFlow
+# (see :ref:`front end tutorials<tutorial-frontend>`).
+#
+# For convolutional neural networks, although auto-scheduler can work correctly
+# with any layout, we found the best performance is typically achieved with 
NHWC layout.
+# We also implemented more optimizations for NHWC layout with the 
auto-scheduler.
+# So it is recommended to convert your models to NHWC layout to use the 
auto-scheduler.
+# You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the 
layout conversion in TVM.
+
+
+def get_network(name, batch_size, layout="NHWC", dtype="float32"):
+    """Get the symbol definition and random weight of a network"""
+
+    # auto-scheduler prefers NHWC layout
+    if layout == "NHWC":
+        image_shape = (224, 224, 3)
+    elif layout == "NCHW":
+        image_shape = (3, 224, 224)
+    else:
+        raise ValueError("Invalid layout: " + layout)
+
+    input_shape = (batch_size,) + image_shape
+    output_shape = (batch_size, 1000)
+
+    if name.startswith("resnet-"):
+        n_layer = int(name.split("-")[1])
+        mod, params = relay.testing.resnet.get_workload(
+            num_layers=n_layer,
+            batch_size=batch_size,
+            layout=layout,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name.startswith("resnet3d-"):
+        n_layer = int(name.split("-")[1])
+        mod, params = relay.testing.resnet.get_workload(
+            num_layers=n_layer,
+            batch_size=batch_size,
+            layout=layout,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name == "mobilenet":
+        mod, params = relay.testing.mobilenet.get_workload(
+            batch_size=batch_size, layout=layout, dtype=dtype, 
image_shape=image_shape
+        )
+    elif name == "squeezenet_v1.1":
+        assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
+        mod, params = relay.testing.squeezenet.get_workload(
+            version="1.1",
+            batch_size=batch_size,
+            dtype=dtype,
+            image_shape=image_shape,
+        )
+    elif name == "inception_v3":
+        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else 
(batch_size, 299, 299, 3)
+        mod, params = 
relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
+    elif name == "mxnet":
+        # an example for mxnet model
+        from mxnet.gluon.model_zoo.vision import get_model
+
+        assert layout == "NCHW"
+
+        block = get_model("resnet50_v1", pretrained=True)
+        mod, params = relay.frontend.from_mxnet(block, shape={"data": 
input_shape}, dtype=dtype)
+        net = mod["main"]
+        net = relay.Function(
+            net.params, relay.nn.softmax(net.body), None, net.type_params, 
net.attrs
+        )
+        mod = tvm.IRModule.from_expr(net)
+
+    return mod, params, input_shape, output_shape
+
+
+#################################################################
+# Start RPC Tracker
+# -----------------
+# TVM uses RPC session to communicate with ARM boards.
+# During tuning, the tuner will send the generated code to the board and
+# measure the speed of code on the board.
+#
+# To scale up the tuning, TVM uses RPC Tracker to manage distributed devices.
+# The RPC Tracker is a centralized controller node. We can register all 
devices to
+# the tracker. For example, if we have 10 phones, we can register all of them
+# to the tracker, and run 10 measurements in parallel, accelerating the tuning 
process.
+#
+# To start an RPC tracker, run this command on the host machine. The tracker is
+# required during the whole tuning process, so we need to open a new terminal 
for
+# this command:
+#
+# .. code-block:: bash
+#
+#   python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
+#
+# The expected output is
+#
+# .. code-block:: bash
+#
+#   INFO:RPCTracker:bind to 0.0.0.0:9190
+
+#################################################################
+# Register Devices to RPC Tracker
+# -----------------------------------
+# Now we can register our devices to the tracker. The first step is to
+# build the TVM runtime for the ARM devices.
+#
+# * For Linux:
+#   Follow this section :ref:`build-tvm-runtime-on-device` to build
+#   the TVM runtime on the device. Then register the device to tracker by
+#
+#   .. code-block:: bash
+#
+#     python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rasp4b-64
+#
+#   (replace :code:`[HOST_IP]` with the IP address of your host machine)
+#
+# * For Android:
+#   Follow this `readme page 
<https://github.com/apache/tvm/tree/main/apps/android_rpc>`_ to
+#   install the TVM RPC APK on the android device. Make sure you can pass the 
android rpc test.
+#   Then you have already registered your device. During tuning, you have to 
go to developer option
+#   and enable "Keep screen awake during changing" and charge your phone to 
make it stable.
+#
+# After registering devices, we can confirm it by querying rpc_tracker
+#
+# .. code-block:: bash
+#
+#   python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
+#
+# For example, if we have 2 Huawei mate10 pro, 11 Raspberry Pi 4B with 64bit 
OS, and 2 rk3399,
+# the output can be
+#
+# .. code-block:: bash
+#
+#    Queue Status
+#    ----------------------------------
+#    key          total  free  pending
+#    ----------------------------------
+#    mate10pro    2      2     0
+#    rk3399       2      2     0
+#    rasp4b-64    11     11    0
+#    ----------------------------------
+#
+# You can register multiple devices to the tracker to accelerate the 
measurement in tuning.
+
+###########################################
+# Set Tuning Options
+# ------------------
+# Before tuning, we should apply some configurations. Here I use a Raspberry 
Pi 4b 4GB board
+# as example with a 64bit OS (Ubuntu 20.04). In your setting, you should 
modify the target
+# and device_key accordingly.
+# set :code:`use_ndk` to True if you use android phone.
+
+#### DEVICE CONFIG ####
+
+# Replace "aarch64-linux-gnu" with the correct target of your board.
+# This target is used for cross compilation. You can query it by :code:`gcc 
-v` on your device.
+# FIXME(tmoreau89, merrymercy): We leave '-device=arm_cpu' out of the target 
string
+#                               because we're sharing x86 op strategy.
+target = tvm.target.Target("llvm -mtriple=aarch64-linux-gnu -mattr=+neon")
+
+# Also replace this with the device key in your tracker
+device_key = "rasp4b-64"
+
+# Set this to True if you use ndk tools for cross compiling
+# And also set the environment variable below to point to the cross compiler
+use_ndk = False
+# os.environ["TVM_NDK_CC"] = "/usr/bin/aarch64-linux-gnu-g++"
+
+#### TUNING OPTION ####
+network = "mobilenet"
+batch_size = 1
+layout = "NHWC"
+dtype = "float32"
+log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, 
target.kind.name)
+
+#################################################################
+# Extract Search Tasks
+# --------------------
+# Next, we extract the search tasks and their weights from a network.
+# The weight of a task is the number of appearances of the task's subgraph
+# in the whole network.
+# By using the weight, we can approximate the end-to-end latency of the network
+# as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
+# latency of a task and :code:`weight[t]` is the weight of the task.
+# The task scheduler will just optimize this objective.
+
+# Extract tasks from the network
+print("Extract tasks...")
+mod, params, input_shape, output_shape = get_network(network, batch_size, 
layout, dtype=dtype)
+tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
+
+for idx, task in enumerate(tasks):
+    print("========== Task %d  (workload key: %s) ==========" % (idx, 
task.workload_key))
+    print(task.compute_dag)
+
+
+#################################################################
+# Tuning and Evaluation
+# ---------------------
+# Now, we set some options for tuning and launch the search tasks
+#
+# * :code:`num_measure_trials` is the number of measurement trials we can use 
during the tuning.
+#   You can set it to a small number (e.g., 200) for a fast demonstrative run.
+#   In practice, we recommend setting it around :code:`800 * len(tasks)`,
+#   which is typically enough for the search to converge.
+#   For example, there are 29 tasks in resnet-50, so we can set it as 20000.
+#   You can adjust this parameter according to your time budget.
+# * In addition, we use :code:`RecordToFile` to dump measurement records into 
a log file,
+#   The measurement records can be used to query the history best, resume the 
search,
+#   and do more analyses later.
+# * see :any:`auto_scheduler.TuningOptions`,
+#   :any:`auto_scheduler.LocalRunner` for more parameters.
+#
+# After auto-tuning, we can compile the network with the best schedules we 
found.
+# All measurement records are dumped into the log file during auto-tuning,
+# so we can read the log file and load the best schedules.
+
+
+def tune_and_evaluate():
+    print("Begin tuning...")
+    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
+    tune_option = auto_scheduler.TuningOptions(
+        num_measure_trials=200,  # change this to 20000 to achieve the best 
performance
+        runner=auto_scheduler.RPCRunner(
+            device_key,
+            host="0.0.0.0",
+            port=9191,
+            timeout=30,
+            repeat=1,
+            min_repeat_ms=200,
+            enable_cpu_cache_flush=True,
+        ),
+        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
+    )
+
+    tuner.tune(tune_option)
+
+    # Compile with the history best
+    print("Compile...")
+    with auto_scheduler.ApplyHistoryBest(log_file):
+        with tvm.transform.PassContext(
+            opt_level=3, config={"relay.backend.use_auto_scheduler": True}
+        ):
+            lib = relay.build(mod, target=target, params=params)
+
+    # Export library
+    tmp = tempdir()
+    if use_ndk:
+        from tvm.contrib import ndk
+
+        filename = "net.so"
+        lib.export_library(tmp.relpath(filename), ndk.create_shared)
+    else:
+        filename = "net.tar"
+        lib.export_library(tmp.relpath(filename))
+
+    # Upload module to device
+    print("Upload...")
+    remote = auto_scheduler.utils.request_remote(device_key, "0.0.0.0", 9191, 
timeout=10000)
+    remote.upload(tmp.relpath(filename))
+    rlib = remote.load_module(filename)
+
+    # Create graph runtime
+    ctx = remote.cpu()
+    module = graph_runtime.GraphModule(rlib["default"](ctx))
+    data_tvm = 
tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
+    module.set_input("data", data_tvm)
+
+    # Evaluate
+    print("Evaluate inference time cost...")
+    ftimer = module.module.time_evaluator("run", ctx, repeat=3, 
min_repeat_ms=500)
+    prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
+    print(
+        "Mean inference time (std dev): %.2f ms (%.2f ms)" % 
(np.mean(prof_res), np.std(prof_res))
+    )
+
+
+# We do not run the tuning in our webpage server since the server doesn't have 
a Raspberry Pi,
+# or device tracker running.
+# Uncomment the following line to run it by yourself.
+
+# tune_and_evaluate()
+
+
+######################################################################
+# .. note:: Explaining the printed information during tuning
+#
+#   During the tuning, a lot of information will be printed on the console.
+#   They are used for debugging purposes. The most important info is the output
+#   of the task scheduler. The following table is a sample output.
+#
+#   .. code-block:: c
+#
+#    ----------------------------------------------------------------------
+#    ------------------------------  [ Task Scheduler ]
+#    ----------------------------------------------------------------------
+#    |  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
+#    -------------------------------------------------
+#    |    0 |        0.013 |           0.31 |     64 |
+#    |    1 |        0.845 |           2.43 |    448 |
+#    |    2 |        0.046 |          -0.00 |     64 |
+#    |    3 |        4.194 |          24.53 |   2112 |
+#    |    4 |        0.109 |           9.21 |     64 |
+#    |    5 |        1.759 |          29.27 |    896 |
+#    |    6 |        0.083 |           6.01 |     64 |
+#    |    7 |        3.084 |          33.38 |   7680 |
+#    |    8 |        0.136 |          14.78 |    384 |
+#    |    9 |        1.349 |          38.23 |    768 |
+#    |   10 |        0.133 |           7.55 |    128 |
+#    |   11 |        2.747 |          37.56 |   1536 |
+#    |   12 |        0.338 |          11.87 |    192 |
+#    |   13 |        1.295 |          40.00 |    704 |
+#    |   14 |        0.482 |           4.16 |    256 |
+#    |   15 |        2.686 |          38.56 |   1344 |
+#    |   16 |        0.884 |           9.08 |    448 |
+#    |   17 |        1.332 |          39.18 |    704 |
+#    |   18 |        1.045 |           3.84 |    576 |
+#    |   19 |        1.391 |          38.09 |    704 |
+#    |   20 |        0.777 |          10.34 |    448 |
+#    |   21 |        0.739 |          30.97 |    448 |
+#    -------------------------------------------------
+#     Estimated total latency: 38.347 ms      Trials: 19992   Used time : 
19260 s     Next ID: 3
+#
+#   This table lists the latency and (estimated) speed of all tasks.
+#   It also lists the allocation of measurement trials for all tasks.
+#   The last line prints the total weighted latency of these tasks,
+#   which can be a rough estimation of the end-to-end execution time
+#   of the network.
+#   The last line also prints the total number of measurement trials,
+#   total time spent on auto-tuning and the id of the next task to tune.
+#
+#   There will also be some "dmlc::Error"s errors, because the
+#   auto-scheduler will try some invalid schedules.
+#   You can safely ignore them if the tuning can continue, because these
+#   errors are isolated from the main process.
+#
+
+######################################################################
+# .. note:: Terminate the tuning earlier
+#
+#   You can terminate the tuning earlier by forcibly killing this process.
+#   As long as you get at least one valid schedule for each task in the log 
file,
+#   you should be able to do the compilation (the secion below).
+#
+
+#################################################################
+# Other Tips
+# ----------
+# 1. During the tuning, the auto-scheduler needs to compile many programs and
+#    extract feature from them. This part is CPU-intensive,
+#    so a high-performance CPU with many cores is recommended for faster 
search.
+# 2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode 
distill --i log.json`
+#    to distill the large log file and only save the best useful records.
+# 3. You can resume a search from the previous log file. You just need to
+#    add a new argument :code:`load_log_file` when creating the task scheduler
+#    in function :code:`run_tuning`. Say,
+#    :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights, 
load_log_file=log_file)`
+# 4. If you have multiple target CPUs, you can use all of them for 
measurements to
+#    parallelize the measurements. Check this :ref:`section 
<tutorials-autotvm-rpc-tracker>`
+#    to learn how to use the RPC Tracker and RPC Server.
+#    To use the RPC Tracker in auto-scheduler, replace the runner in 
:code:`TuningOptions`
+#    with :any:`auto_scheduler.RPCRunner`.

Reply via email to