This is an automated email from the ASF dual-hosted git repository.
lmzheng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git
The following commit(s) were added to refs/heads/main by this push:
new 5d33491 [Tutorial] Autoscheduler on ARM devices (#7326)
5d33491 is described below
commit 5d3349104a1dc4b84f9a744aeee9b124df231f04
Author: Thierry Moreau <[email protected]>
AuthorDate: Sun Jan 24 23:38:17 2021 -0800
[Tutorial] Autoscheduler on ARM devices (#7326)
* arm tuning tutorial
* adjustment to get RPC working
* fix lint
* fix target
* integrate Leandros comments
* dont request remote in CI
* use API from auto_scheduler, not autoTVM and updated comments
* make ci-runnable
* fix the formatting
* address Zhaos comments
* full run stats
* taking Zhaos comments into consideration
---
tutorials/auto_scheduler/tune_network_arm.py | 421 +++++++++++++++++++++++++++
1 file changed, 421 insertions(+)
diff --git a/tutorials/auto_scheduler/tune_network_arm.py
b/tutorials/auto_scheduler/tune_network_arm.py
new file mode 100644
index 0000000..f821c2e
--- /dev/null
+++ b/tutorials/auto_scheduler/tune_network_arm.py
@@ -0,0 +1,421 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Auto-scheduling a Neural Network for ARM CPU
+=============================================
+**Author**: `Thierry Moreau <https://github.com/tmoreau89, Lianmin Zheng
<https://github.com/merrymercy>>`_
+
+Auto-tuning for specific devices and workloads is critical for getting the
+best performance. This is a tutorial on how to tune a whole neural
+network for ARM CPU with the auto-scheduler via RPC.
+
+To auto-tune a neural network, we partition the network into small subgraphs
and
+tune them independently. Each subgraph is treated as one search task.
+A task scheduler slices the time and dynamically allocates time resources to
+these tasks. The task scheduler predicts the impact of each task on the
end-to-end
+execution time and prioritizes the one that can reduce the execution time the
most.
+
+For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to
+get the computational DAG in the tensor expression form.
+We then use the auto-scheduler to construct a search space of this DAG and
search
+for good schedules (low-level optimizations).
+
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which
relies on
+manual templates to define the search space, the auto-scheduler does not
require any
+schedule templates. In other words, the auto-scheduler only uses the compute
declarations
+in :code:`tvm/python/topi` and does not use existing schedule templates.
+
+Note that this tutorial will not run on Windows or recent versions of macOS. To
+get it to run, you will need to wrap the body of this tutorial in a :code:`if
+__name__ == "__main__":` block.
+"""
+
+import numpy as np
+
+import tvm
+from tvm import relay, auto_scheduler
+import tvm.relay.testing
+from tvm.contrib import graph_runtime
+from tvm.contrib.utils import tempdir
+
+#################################################################
+# Define a Network
+# ----------------
+# First, we need to define the network with relay frontend API.
+# We can load some pre-defined network from :code:`tvm.relay.testing`.
+# We can also load models from MXNet, ONNX, PyTorch, and TensorFlow
+# (see :ref:`front end tutorials<tutorial-frontend>`).
+#
+# For convolutional neural networks, although auto-scheduler can work correctly
+# with any layout, we found the best performance is typically achieved with
NHWC layout.
+# We also implemented more optimizations for NHWC layout with the
auto-scheduler.
+# So it is recommended to convert your models to NHWC layout to use the
auto-scheduler.
+# You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the
layout conversion in TVM.
+
+
+def get_network(name, batch_size, layout="NHWC", dtype="float32"):
+ """Get the symbol definition and random weight of a network"""
+
+ # auto-scheduler prefers NHWC layout
+ if layout == "NHWC":
+ image_shape = (224, 224, 3)
+ elif layout == "NCHW":
+ image_shape = (3, 224, 224)
+ else:
+ raise ValueError("Invalid layout: " + layout)
+
+ input_shape = (batch_size,) + image_shape
+ output_shape = (batch_size, 1000)
+
+ if name.startswith("resnet-"):
+ n_layer = int(name.split("-")[1])
+ mod, params = relay.testing.resnet.get_workload(
+ num_layers=n_layer,
+ batch_size=batch_size,
+ layout=layout,
+ dtype=dtype,
+ image_shape=image_shape,
+ )
+ elif name.startswith("resnet3d-"):
+ n_layer = int(name.split("-")[1])
+ mod, params = relay.testing.resnet.get_workload(
+ num_layers=n_layer,
+ batch_size=batch_size,
+ layout=layout,
+ dtype=dtype,
+ image_shape=image_shape,
+ )
+ elif name == "mobilenet":
+ mod, params = relay.testing.mobilenet.get_workload(
+ batch_size=batch_size, layout=layout, dtype=dtype,
image_shape=image_shape
+ )
+ elif name == "squeezenet_v1.1":
+ assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
+ mod, params = relay.testing.squeezenet.get_workload(
+ version="1.1",
+ batch_size=batch_size,
+ dtype=dtype,
+ image_shape=image_shape,
+ )
+ elif name == "inception_v3":
+ input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else
(batch_size, 299, 299, 3)
+ mod, params =
relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
+ elif name == "mxnet":
+ # an example for mxnet model
+ from mxnet.gluon.model_zoo.vision import get_model
+
+ assert layout == "NCHW"
+
+ block = get_model("resnet50_v1", pretrained=True)
+ mod, params = relay.frontend.from_mxnet(block, shape={"data":
input_shape}, dtype=dtype)
+ net = mod["main"]
+ net = relay.Function(
+ net.params, relay.nn.softmax(net.body), None, net.type_params,
net.attrs
+ )
+ mod = tvm.IRModule.from_expr(net)
+
+ return mod, params, input_shape, output_shape
+
+
+#################################################################
+# Start RPC Tracker
+# -----------------
+# TVM uses RPC session to communicate with ARM boards.
+# During tuning, the tuner will send the generated code to the board and
+# measure the speed of code on the board.
+#
+# To scale up the tuning, TVM uses RPC Tracker to manage distributed devices.
+# The RPC Tracker is a centralized controller node. We can register all
devices to
+# the tracker. For example, if we have 10 phones, we can register all of them
+# to the tracker, and run 10 measurements in parallel, accelerating the tuning
process.
+#
+# To start an RPC tracker, run this command on the host machine. The tracker is
+# required during the whole tuning process, so we need to open a new terminal
for
+# this command:
+#
+# .. code-block:: bash
+#
+# python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
+#
+# The expected output is
+#
+# .. code-block:: bash
+#
+# INFO:RPCTracker:bind to 0.0.0.0:9190
+
+#################################################################
+# Register Devices to RPC Tracker
+# -----------------------------------
+# Now we can register our devices to the tracker. The first step is to
+# build the TVM runtime for the ARM devices.
+#
+# * For Linux:
+# Follow this section :ref:`build-tvm-runtime-on-device` to build
+# the TVM runtime on the device. Then register the device to tracker by
+#
+# .. code-block:: bash
+#
+# python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rasp4b-64
+#
+# (replace :code:`[HOST_IP]` with the IP address of your host machine)
+#
+# * For Android:
+# Follow this `readme page
<https://github.com/apache/tvm/tree/main/apps/android_rpc>`_ to
+# install the TVM RPC APK on the android device. Make sure you can pass the
android rpc test.
+# Then you have already registered your device. During tuning, you have to
go to developer option
+# and enable "Keep screen awake during changing" and charge your phone to
make it stable.
+#
+# After registering devices, we can confirm it by querying rpc_tracker
+#
+# .. code-block:: bash
+#
+# python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
+#
+# For example, if we have 2 Huawei mate10 pro, 11 Raspberry Pi 4B with 64bit
OS, and 2 rk3399,
+# the output can be
+#
+# .. code-block:: bash
+#
+# Queue Status
+# ----------------------------------
+# key total free pending
+# ----------------------------------
+# mate10pro 2 2 0
+# rk3399 2 2 0
+# rasp4b-64 11 11 0
+# ----------------------------------
+#
+# You can register multiple devices to the tracker to accelerate the
measurement in tuning.
+
+###########################################
+# Set Tuning Options
+# ------------------
+# Before tuning, we should apply some configurations. Here I use a Raspberry
Pi 4b 4GB board
+# as example with a 64bit OS (Ubuntu 20.04). In your setting, you should
modify the target
+# and device_key accordingly.
+# set :code:`use_ndk` to True if you use android phone.
+
+#### DEVICE CONFIG ####
+
+# Replace "aarch64-linux-gnu" with the correct target of your board.
+# This target is used for cross compilation. You can query it by :code:`gcc
-v` on your device.
+# FIXME(tmoreau89, merrymercy): We leave '-device=arm_cpu' out of the target
string
+# because we're sharing x86 op strategy.
+target = tvm.target.Target("llvm -mtriple=aarch64-linux-gnu -mattr=+neon")
+
+# Also replace this with the device key in your tracker
+device_key = "rasp4b-64"
+
+# Set this to True if you use ndk tools for cross compiling
+# And also set the environment variable below to point to the cross compiler
+use_ndk = False
+# os.environ["TVM_NDK_CC"] = "/usr/bin/aarch64-linux-gnu-g++"
+
+#### TUNING OPTION ####
+network = "mobilenet"
+batch_size = 1
+layout = "NHWC"
+dtype = "float32"
+log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size,
target.kind.name)
+
+#################################################################
+# Extract Search Tasks
+# --------------------
+# Next, we extract the search tasks and their weights from a network.
+# The weight of a task is the number of appearances of the task's subgraph
+# in the whole network.
+# By using the weight, we can approximate the end-to-end latency of the network
+# as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the
+# latency of a task and :code:`weight[t]` is the weight of the task.
+# The task scheduler will just optimize this objective.
+
+# Extract tasks from the network
+print("Extract tasks...")
+mod, params, input_shape, output_shape = get_network(network, batch_size,
layout, dtype=dtype)
+tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
+
+for idx, task in enumerate(tasks):
+ print("========== Task %d (workload key: %s) ==========" % (idx,
task.workload_key))
+ print(task.compute_dag)
+
+
+#################################################################
+# Tuning and Evaluation
+# ---------------------
+# Now, we set some options for tuning and launch the search tasks
+#
+# * :code:`num_measure_trials` is the number of measurement trials we can use
during the tuning.
+# You can set it to a small number (e.g., 200) for a fast demonstrative run.
+# In practice, we recommend setting it around :code:`800 * len(tasks)`,
+# which is typically enough for the search to converge.
+# For example, there are 29 tasks in resnet-50, so we can set it as 20000.
+# You can adjust this parameter according to your time budget.
+# * In addition, we use :code:`RecordToFile` to dump measurement records into
a log file,
+# The measurement records can be used to query the history best, resume the
search,
+# and do more analyses later.
+# * see :any:`auto_scheduler.TuningOptions`,
+# :any:`auto_scheduler.LocalRunner` for more parameters.
+#
+# After auto-tuning, we can compile the network with the best schedules we
found.
+# All measurement records are dumped into the log file during auto-tuning,
+# so we can read the log file and load the best schedules.
+
+
+def tune_and_evaluate():
+ print("Begin tuning...")
+ tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
+ tune_option = auto_scheduler.TuningOptions(
+ num_measure_trials=200, # change this to 20000 to achieve the best
performance
+ runner=auto_scheduler.RPCRunner(
+ device_key,
+ host="0.0.0.0",
+ port=9191,
+ timeout=30,
+ repeat=1,
+ min_repeat_ms=200,
+ enable_cpu_cache_flush=True,
+ ),
+ measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
+ )
+
+ tuner.tune(tune_option)
+
+ # Compile with the history best
+ print("Compile...")
+ with auto_scheduler.ApplyHistoryBest(log_file):
+ with tvm.transform.PassContext(
+ opt_level=3, config={"relay.backend.use_auto_scheduler": True}
+ ):
+ lib = relay.build(mod, target=target, params=params)
+
+ # Export library
+ tmp = tempdir()
+ if use_ndk:
+ from tvm.contrib import ndk
+
+ filename = "net.so"
+ lib.export_library(tmp.relpath(filename), ndk.create_shared)
+ else:
+ filename = "net.tar"
+ lib.export_library(tmp.relpath(filename))
+
+ # Upload module to device
+ print("Upload...")
+ remote = auto_scheduler.utils.request_remote(device_key, "0.0.0.0", 9191,
timeout=10000)
+ remote.upload(tmp.relpath(filename))
+ rlib = remote.load_module(filename)
+
+ # Create graph runtime
+ ctx = remote.cpu()
+ module = graph_runtime.GraphModule(rlib["default"](ctx))
+ data_tvm =
tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
+ module.set_input("data", data_tvm)
+
+ # Evaluate
+ print("Evaluate inference time cost...")
+ ftimer = module.module.time_evaluator("run", ctx, repeat=3,
min_repeat_ms=500)
+ prof_res = np.array(ftimer().results) * 1e3 # convert to millisecond
+ print(
+ "Mean inference time (std dev): %.2f ms (%.2f ms)" %
(np.mean(prof_res), np.std(prof_res))
+ )
+
+
+# We do not run the tuning in our webpage server since the server doesn't have
a Raspberry Pi,
+# or device tracker running.
+# Uncomment the following line to run it by yourself.
+
+# tune_and_evaluate()
+
+
+######################################################################
+# .. note:: Explaining the printed information during tuning
+#
+# During the tuning, a lot of information will be printed on the console.
+# They are used for debugging purposes. The most important info is the output
+# of the task scheduler. The following table is a sample output.
+#
+# .. code-block:: c
+#
+# ----------------------------------------------------------------------
+# ------------------------------ [ Task Scheduler ]
+# ----------------------------------------------------------------------
+# | ID | Latency (ms) | Speed (GFLOPS) | Trials |
+# -------------------------------------------------
+# | 0 | 0.013 | 0.31 | 64 |
+# | 1 | 0.845 | 2.43 | 448 |
+# | 2 | 0.046 | -0.00 | 64 |
+# | 3 | 4.194 | 24.53 | 2112 |
+# | 4 | 0.109 | 9.21 | 64 |
+# | 5 | 1.759 | 29.27 | 896 |
+# | 6 | 0.083 | 6.01 | 64 |
+# | 7 | 3.084 | 33.38 | 7680 |
+# | 8 | 0.136 | 14.78 | 384 |
+# | 9 | 1.349 | 38.23 | 768 |
+# | 10 | 0.133 | 7.55 | 128 |
+# | 11 | 2.747 | 37.56 | 1536 |
+# | 12 | 0.338 | 11.87 | 192 |
+# | 13 | 1.295 | 40.00 | 704 |
+# | 14 | 0.482 | 4.16 | 256 |
+# | 15 | 2.686 | 38.56 | 1344 |
+# | 16 | 0.884 | 9.08 | 448 |
+# | 17 | 1.332 | 39.18 | 704 |
+# | 18 | 1.045 | 3.84 | 576 |
+# | 19 | 1.391 | 38.09 | 704 |
+# | 20 | 0.777 | 10.34 | 448 |
+# | 21 | 0.739 | 30.97 | 448 |
+# -------------------------------------------------
+# Estimated total latency: 38.347 ms Trials: 19992 Used time :
19260 s Next ID: 3
+#
+# This table lists the latency and (estimated) speed of all tasks.
+# It also lists the allocation of measurement trials for all tasks.
+# The last line prints the total weighted latency of these tasks,
+# which can be a rough estimation of the end-to-end execution time
+# of the network.
+# The last line also prints the total number of measurement trials,
+# total time spent on auto-tuning and the id of the next task to tune.
+#
+# There will also be some "dmlc::Error"s errors, because the
+# auto-scheduler will try some invalid schedules.
+# You can safely ignore them if the tuning can continue, because these
+# errors are isolated from the main process.
+#
+
+######################################################################
+# .. note:: Terminate the tuning earlier
+#
+# You can terminate the tuning earlier by forcibly killing this process.
+# As long as you get at least one valid schedule for each task in the log
file,
+# you should be able to do the compilation (the secion below).
+#
+
+#################################################################
+# Other Tips
+# ----------
+# 1. During the tuning, the auto-scheduler needs to compile many programs and
+# extract feature from them. This part is CPU-intensive,
+# so a high-performance CPU with many cores is recommended for faster
search.
+# 2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode
distill --i log.json`
+# to distill the large log file and only save the best useful records.
+# 3. You can resume a search from the previous log file. You just need to
+# add a new argument :code:`load_log_file` when creating the task scheduler
+# in function :code:`run_tuning`. Say,
+# :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights,
load_log_file=log_file)`
+# 4. If you have multiple target CPUs, you can use all of them for
measurements to
+# parallelize the measurements. Check this :ref:`section
<tutorials-autotvm-rpc-tracker>`
+# to learn how to use the RPC Tracker and RPC Server.
+# To use the RPC Tracker in auto-scheduler, replace the runner in
:code:`TuningOptions`
+# with :any:`auto_scheduler.RPCRunner`.