[GitHub] [tvm-rfcs] areusch commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

GitBox Thu, 24 Feb 2022 14:25:43 -0800


areusch commented on a change in pull request #48:
URL: https://github.com/apache/tvm-rfcs/pull/48#discussion_r814216238




##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: 
[apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: 
[BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to 
bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM 
Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, 
software suite for our customers by integrating and utilizing TVM so that we 
can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our 
TVM-BYOC-Marvell POC code files and flow under the following folders -- refer 
to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given 
pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as 
illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and 
TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two 
outputs coming out of STEP (1) include a pair of Nodes-JSON file and 
Constants-JSON file for each Marvell sub-graph. This pair of JSON files 
represents the meta-data information of a Marvell sub-graph, which is a part of 
the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet 
for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", 
pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": 
inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, 
mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv 
site; then, call the mrvl.partition\_for\_mrvl() function to trigger the 
TVM-BYOC-Marvell flow; and finally, call relay.build() function and 
mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files 
for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
+
+We are calling the byoc\_executor.get\_external\_graph\_json() function and 
the byoc\_executor.get\_params() function in order to generate both Nodes-JSON 
file and Constants-JSON file, respectively.
+
+* The get\_external\_graph\_json() function is a new addition to Python class 
BuildModule(object).
+* The get\_params() function exists for Python class BuildModule(object), but 
to make it work, we need to disable the "removal external params" CPP code 
block in the CPP class RelayBuildModule.
+
+Sub steps involved in STEP (1) are (refer to Figures 1, 2a, 2b, 3 with 
descriptions below):
+
+* Load pre-trained network into TVM IR graph.
+* Do Marvell-specific layout conversions to transform IR graph in order to 
meet requirements of the accelerator.
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order 
to utilize available HW capability in the accelerator.
+* Do additional Marvell-specific transform pass(es) to further optimize IR 
graph.
+* Partition IR graph into one or more for-accelerator Marvell sub-graphs 
and/or one or more LLVM-non-Marvell sub-graphs (e.g., for running inference on 
ARMv9):
+
+    * These sub-graphs cover the whole pre-trained network.
+
+    * For-accelerator Marvell sub-graph here means & contains a set of 
connected, composite-merged/fused Call nodes (i.e., not just one 
compoiste-merged/fused Call node function).  NOTE: the term sub-graph defined 
here can be different from existing TVM sub-graph definition.
+
+    * As shown in Figure 2a, a pre-trained CNN ONNX model (on the left) is 
processed by the TVM-BYOC-Marvell flow into only one Marvell sub-graph 
(illustrated in the middle of Figure 2a) where operators of given ONNX model 
are composite-merged/fused into 8 fused composition function in the Marvell 
sub-graph. For example, near bottom left a set of MatMul + Add + Relu operators 
of the ONNX model are fused into one tvmgen\_mrvl\_main\_7 composition function 
in the Marvell sub-graph.
+
+    * As another example in Figure 2b, given the same CNN ONNX model, we can 
apply a different argument value but this time to ask the TVM-BYOC-Marvell 
flow, mrvl.partition\_for\_mrvl(...), to identify one Marvell sub-graph of 4 
fused composition Call node functions and another LLVM-non-Marvell sub-graph as 
illustrated in the middle top sub-graph A and in the middle bottom sub-graph B, 
respectively.  This special argument value can lead to different inference 
performance in terms of meeting latency, bandwidth, and/or memory requirements.
+
+    * For the first TVM-BYOC-Marvell revision, at most one for-accelerator 
Marvell sub-graph and at most one LLVM-non-Marvell sub-graph can be identified; 
plus, the for-accelerator Marvell sub-graph can only use input tensor(s) of 
given pre-trained network as its sub-graph’s input tensors.
+
+    * Figure 3 illustrate how a complex Marvell sub-graph can look like. The 
whole sub-graph shown here represents a Marvell sub-graph of more than 100 
fused compositions Call node functions and it comes from the pre-trained 
SSD-ResNet50 MXNet model. The LLVM-non-Marvell sub-graph part of the 
SSD-ResNet50 model is not displayed here but it contains rest of the 
object-detection part of the model in order to finalize 2D-BBOXes and labels.
+
+* Do code-gen step for each Marvell sub-graph by producing pair of Nodes-JSON 
and Constants-JSON files:
+
+    * The TVM-BYOC-Marvell flow also pecifies Marvell attributes for each 
composite-merged/fused Call node function so that generated Nodes-JSON file(s) 
and Constants-JSON file(s) can represent the meta-data inforamtion of Marvell 
sub-graph(s) in order to do post-processing.
+
+    * RFC reviewer feedback: can we identify the Marvell sub-graph by running 
a TIR-only pass after scheduling (with the potential benefit to also operate on 
the logical TIR buffers)? Marvell developer can and will spend time on 
understand the TIR flow and its pass to find out.
+
+![](./assets/0048/figure2a-onnx-1-mrvl-sub-graph-backend-layers.png)
+
+![](./assets/0048/figure2b-onnx-mrvl-sub-graph-A-llvm-sub-graph-B.png)
+
+![](./assets/0048/figure3-sample-mrvl-sub-graph-for-ssd-resnet50.png)
+
+
+### STEP (2) Run Marvell-ML/AI Backend Compiler to generate model binary for 
each Marvell sub-graph
+
+* As shown in middle left section of Figure 1, labeled as (2), we will 
execute, outside of the typical TVM flow, the Marvell-ML/AI backend compiler 
program to post-process Nodes-JSON and Constants-JSON files of each Marvell 
sub-graph in order to generate final ISA instructions (in a Marvell model 
binary file) to run inference on Marvell accelerator.
+
+* The Marvell-ML/AI backend compiler program will be distributed as: 
mrvl-tvmircomp. For example, the command line below can be used to generate the 
model binary file for a pair of CNN JSON files to run fp16-based inference by 
utilizing 1M bytes of On-Chip memory on each of 4 HW compute tiles:
+
+```
+  $ mrvl-tvmircomp --model_name cnn --nodes cnn-tvm-mrvl-byoc-ir.json \
+        --consts cnn-tvm-mrvl-byoc-const.json \
+        --arch=MLIP --dram_addr_relocatable=1 --ocm_base=0x0 
-ocm_size=0x100000 \
+        --num_tiles=4 --quantize=float16
+
+  note: the output model binary file generated is: cnn.bin
+
+```
+
+* Marvell backend compiler does additional optimizations AOT including to 
group, allocate, and map layer-based tensors and computes onto pre-allocated 
resources (such as above: 4 compute tiles and 1M bytes on each of 4 tiles) 
avaialble on the Marvell accelerator.  Sample layer-based structures used by 
ISA instructions for the CNN model are illustrated in the right most column in 
both Figure 2a and Figure 2b.
+
+* Note: Marvell ML/AI accelerator can run inference in either float16 mode or 
int8 quantization mode. For this RFC, we will focus only on float16 AOT 
compilation to run float16 inference.
+
+* Note: Marvell can provide a mrvl-tvmircomp executable to TVM CI environment 
to run TVM Jenkins build & tests.
+
+
+### STEP (3a) or (3b) Run inference on the Software Simulator or on the 
Marvell ML/AI HW accelerator for the Marvell sub-graph
+
+* As illustrated in the middle left section of Figure 1, labeled as (3a), a 
cycle-approximate Marvell Software Simulator, mlModel, which cycle 
approximately mimics the Marvell ML/AI HW accelerator, will be distributed, The 
Marvell Software Simulator can be used to read in a Marvell model binary file 
and its corresponding inference input file(s) to run inference to generate 
results for the Marvell sub-graph. For example, the command line below can be 
used to run inference:
+
+```
+  $ mlModel --model_binary cnn.bin --inputs cnn_input/input1.bin --arch=MLIP 
--perf_debug
+
+  note1: the inference output will be saved at: cnn-output.bin
+  note2: optionally, cycle level information for performance debug can also 
dump
+
+```
+
+* Note: Marvell can provide a mlModel executable to TVM CI environment to run 
TVM Jenkins build & tests.
+
+* Also as illustrated on the right side of Figure 1, labeled as (3b), tools, 
driver and firmware are available such that they can be used to run inference 
on an Marvell ML/AI inference HW accelerator.
+
+
+### STEP (4) Use TVM-LLVM Compiler & Runtime to run inference for the 
LLVM-non-Marvell sub-graph
+
+* As illustrated in the bottom left section of Figure 1, labeled as (4), an 
integration step between sub-graph(s) need to be done at inference runtime in 
order to run full inference for the given pre-trained model. We can use 
TVM-LLVM flow to generate runtime .so binary for each LLVM-non-Marvell 
sub-graph.  POC code for STEP (4) is not yet ready (WIP) and is not included in 
the uploaded appache/tvm-PR-9730.
+
+* For the first BYOC-Marvell revision, at most one integration step from a 
for-accelerator Marvell sub-graph to a LLVM-non-Marvell sub-graph is 
implemented.
+
+### Exercise TVM-BYOC-Marvell flow
+
+To exercise the TVM-BYOC-Marvell flow, we have provided a 
tests/python/contrib/test\_mrvl folder with test\_mrvl\_codegen.py and 
infrastructure.py files so that they shows how to exercise the TVM-BYOC-Marvell 
flow for a pre-trained SSD-ResNet50 model.  In addition, Marvell are also 
planning to provide the Marvell backend compiler (mrvl-tvmircomp) and the 
Marvell HW accelerator software simulator (mlModel) so that they can be used to 
read in JSON files generated by the TVM-BYOC-Marvell flow to run inference to 
get results.
+
+In the uploaded appache/tvm-PR-9730 branch,

Review comment:
       could you finish this sentence or rm?

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: 
[apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: 
[BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to 
bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM 
Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, 
software suite for our customers by integrating and utilizing TVM so that we 
can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our 
TVM-BYOC-Marvell POC code files and flow under the following folders -- refer 
to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given 
pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as 
illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and 
TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two 
outputs coming out of STEP (1) include a pair of Nodes-JSON file and 
Constants-JSON file for each Marvell sub-graph. This pair of JSON files 
represents the meta-data information of a Marvell sub-graph, which is a part of 
the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet 
for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", 
pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": 
inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, 
mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv 
site; then, call the mrvl.partition\_for\_mrvl() function to trigger the 
TVM-BYOC-Marvell flow; and finally, call relay.build() function and 
mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files 
for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.

Review comment:
       suggest to use numbered list:
   ```suggestion
   The above code snippet does the following:
   1. Download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv site
   2. Call the `mrvl.partition_for_mrvl()` function to partition the graph into 
Marvell and non-Marvell pieces and trigger the TVM-BYOC-Marvell flow
   3. Call relay.build() function and mrvl.dump\_json\_meta\_data\_files() 
function to generate a pair of JSON files for each Marvell sub-graph identified 
by the TVM-BYOC-Marvell flow.
   ```

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,547 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: 
[apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: 
[BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to 
bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM 
Neoverse N2-based OCTEON 10 processor.
+  We are building an easy-to-use, open, software suite for our customers by 
integrating and utilizing TVM so that
+  we can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Based on what Marvell ML/AI inference accelerator does the best, a given 
pre-trained network model
+will be applied to a TVM-Mrvl-BYOC AOT compilation and code-gen flow as 
illustrated in steps below.
+
+STEP (1) Run TVM-Mrvl-BYOC AOT ML Frontend Compilation and Mrvl-BYOC code-gen. 
The steps involved in this are:
+
+* Load pre-trained network into TVM IR graph
+
+* Do Marvell-specific layout conversions to transform IR graph in order to 
meet requirements of the accelerator
+
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order 
to utilize available HW capability
+  in the accelerator
+
+* Do additional Marvell-specific transform pass(es) to further optimize IR 
graph
+
+* Partition IR graph into one or more for-accelerator Mrvl subgraphs and/or 
one or more for-TVM-target non-Mrvl
+  (e.g., ARMv9) subgraphs
+    * These subgraphs cover the whole pre-trained network
+    * For-accelerator Mrvl subgraph here means & contains connected, 
composite-fused Call nodes (let's call this sub-graph A)
+      as in the given IR graph. A composite-merged Call node can be, for 
instance, fused from this sequence of IR call nodes:
+      conv2d + add + batch_norm + tuple.getitem(0) + relu
+    * For the first Marvell-BYOC revision, at most one for-accelerator Mrvl 
subgraph and at most one for-TVM-target
+      non-Mrvl subgraph (let's call this sub-graph B) can be identified; plus, 
the for-accelerator Mrvl subgraph can
+      only use input tensor(s) of given pre-trained network as its subgraph’s 
input tensors
+
+* Do code-gen step for each for-accelerator Mrvl subgraph:
+    * Marvell-BYOC-specific attributes are introduced for each 
composite-merged/fused Call node so that a Nodes-JSON
+      file and a Constants-JSON file are produced for the Mrvl subgraph
+
+STEP (2) Run Mrvl-ML/AI Backend Compiler to generate model binary for each 
Mrvl subgraph
+
+* The Mrvl-ML/AI backend compiler will be distributed as an executable in the 
OCTEON SDK; and it can be used to read
+  in Nodes-JSON and Constants-JSON files of each Mrvl subgraph as input 
meta-data in order to generate final instructions,
+  in model binary file
+
+* Note: Mrvl-ML/AI backend compiler, which does accelerator-specific 
optimization and code generation, is not included
+  to upstream
+
+STEP (3a) or (3b) Run inference on the software Simulator or on the Mrvl ML/AI 
HW accelerator for the Mrvl subgraph
+
+* The Mrvl Software Simulator of the Mrvl ML/AI HW accelerator will be 
distributed as an executable in a Mrvl-ML/AI tar
+  ball; and it can be used to read in input file(s) and the model binary to 
run inference for the Mrvl subgraph
+
+* Note: Mrvl ML/AI accelerator can run inference in either float16 mode or 
int8 quantization mode. For this RFC, we will
+  focus only on float16 inference run
+
+STEP (4) Use TVM-llvm Compiler & Runtime to run inference
+
+* Perform integration steps between sub-graph(s) in order to run inference for 
the given pre-trained network -
+  note: runtime binary for each for-TVM-target non-Mrvl subgraph can be 
generated, for instance, using the regular TVM
+  LLVM build
+
+* For the first Marvell-BYOC revision, at most one integration step from a 
for-accelerator Mrvl subgraph to
+  a TVM-target non-Mrvl subgraph is implemented
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo 
code for illustration).
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: 
{fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train_images 
dataset and save
+  the pre-trained model in ONNX (say, mnist_fashion.onnx). Then, we can run 
BYOC Marvell flow by giving any
+  image of the orig_test_images[i] dataset to get its inference fashion label 
and item name in top_label_id and
+  fashion_label_dictionary[top_label_id], respectively. In addition, we can 
also use the corresponding
+  golden label, golden_output_labels[i], to validate the inference result.
+
+```
+(train_images, train_labels), (
+    orig_test_images,
+    golden_output_labels,
+) = keras.datasets.fashion_mnist.load_data()
+```
+
+As illustrated in the tests/python/contrib/test_mrvl/test_mrvl_codegen.py and 
infrastructure.py files as well as
+  in pseudo code below, we can call onnx.load() and relay.frontend.from_onnx() 
to generate TVM mod and params. Then,
+  they are used as function arguments to call the aot_build_and_json_code() 
API in order to generate Nodes-JSON file
+  (nodes_json_filename) and Constants-JSON file (consts_json_filename).
+
+* Notes: please refer to the python/tvm/relay/op/contrib/mrvl.py file for more 
details.
+
+* In the mrvl.py file: the partition_for_mrvl() function is the main entry 
point for the BYOC Marvell flow.
+
+* We use relay.build(mod_mrvl_subgraph).get_params() and 
relay.build(mod_mrvl_subgraph).get_external_graph_json()
+    to trigger Marvell-specific GetExternalJSON() and JSON load/save functions 
(as defined in the
+    src/relay/backend/contrib/mrvl/graph_executor_codegen_mrvl.cc file) in 
order to generate
+    Marvell-specific byoc_const_params and byoc_external_graph_json objects.
+
+* In the mrvl.py file: the dump_json_meta_data_files() function takes in 
Marvell-specific byoc_external_graph_json
+    and byoc_const_params objects to generate and return two Marvell-specific 
Nodes-JSON file and Constants-JSON file,
+    respectively.
+
+```
+    # load pre-trained model
+    mnist_fashion_onnx_model = onnx.load("mnist_fashion.onnx")
+    mod, params = relay.frontend.from_onnx(
+        mnist_fashion_onnx_model, dtype="float32", freeze_params=False
+    )
+
+
+    # from test_mrvl_codegen.py: to generate sub graphs and JSON files
+    (
+        nodes_json_filename,
+        consts_json_filename,
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        mrvl_layers_in_mrvl_subgraph,
+        mrvl_layers_in_non_mrvl_subgraph,
+    ) = aot_build_and_json_codegen(
+        model_name="mnist_fashion",
+        working_dir="mnist",
+        mod,
+        params,
+    )
+
+
+    # from infrastructure.py: pedueo code defined by the above 
aot_build_and_json_codegen() function
+    (
+        mod_mrvl_subgraph,
+        mod_non_mrvl_subgraph,
+        orig_params,
+        opt_level,
+        disabled_pass,
+        orig_mod,
+        mrvl_layers_in_mrvl_subgraph,
+    ) = mrvl.partition_for_mrvl(
+        mod,
+        params=params,
+        tvm_custom_dict={},
+        gen_non_mrvl_subgraph=gen_non_mrvl_subgraph,
+        flow_pass=1,
+    )
+
+    build_target, device_id = "llvm", 0
+    mod_name = relay.backend.utils.mangle_module_name("")
+    byoc_executor = relay.build(mod_mrvl_subgraph, target=build_target, 
mod_name=mod_name)
+    byoc_const_params = byoc_executor.get_params()
+    byoc_external_graph_json = byoc_executor.get_external_graph_json()
+
+    nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+        byoc_external_graph_json,
+        byoc_const_params,
+        filename_prefix=f"{working_dir}{model_name}-tvm-mrvl-byoc-ir",
+    )
+```
+
+The mod_mrvl_subgraph object and the mod_non_mrvl_subgraph object returned 
from the aot_build_and_json_code()
+  call are IR graphs of one for-accelerator Mrvl subgraph and one TVM-target 
non-Mrvl subgraph, respectively.
+
+Different strategy can be used to cut the MNIST model into different sets of 
at most one Mrvl subgraph and at
+  most one non-Mrvl subgraph. Below we will illustrate one such alternative 
(i.e., the default strategy) so
+  that, for this specific sample MNIST model, the entire network model is 
turned into one Mrvl subgraph and
+  no non-Mrvl subgraph.
+
+* Below is the original IR graph - i.e., right after from_onnx() call
+
+```
+    #[version = "0.0.5"]
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 
10), float32] {
+      %0 = nn.conv2d(%permute_input, meta[relay.Constant][0] /* ty=Tensor[(64, 
1, 2, 2), float32] */,
+          padding=[0, 0, 1, 1], channels=64, kernel_size=[2, 2], /* en_id=418 
*/) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %1 = nn.bias_add(%0, meta[relay.Constant][1] /* ty=Tensor[(64), float32] 
*/,
+          /* en_id=419 */) /* ty=Tensor[(1, 64, 28, 28), float32] */;
+      %2 = nn.relu(%1, /* en_id=420 */) /* ty=Tensor[(1, 64, 28, 28), float32] 
*/;
+      %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 
0, 0],
+          /* en_id=449 */) /* ty=Tensor[(1, 64, 14, 14), float32] */;
+      %4 = nn.conv2d(%3, meta[relay.Constant][2] /* ty=Tensor[(32, 64, 2, 2), 
float32] */,
+          padding=[0, 0, 1, 1], channels=32, kernel_size=[2, 2], /* en_id=472 
*/) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %5 = nn.bias_add(%4, meta[relay.Constant][3] /* ty=Tensor[(32), float32] 
*/,
+          /* en_id=473 */) /* ty=Tensor[(1, 32, 14, 14), float32] */;
+      %6 = nn.relu(%5, /* en_id=474 */) /* ty=Tensor[(1, 32, 14, 14), float32] 
*/;
+      %7 = nn.max_pool2d(%6, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 
0, 0],
+          /* en_id=515 */) /* ty=Tensor[(1, 32, 7, 7), float32] */;
+      %8 = transpose(%7, axes=[0, 2, 3, 1], /* en_id=516 */) /* ty=Tensor[(1, 
7, 7, 32), float32] */;
+      %9 = nn.batch_flatten(%8, /* en_id=538 */) /* ty=Tensor[(1, 1568), 
float32] */;
+      %10 = transpose(meta[relay.Constant][4] /* ty=Tensor[(1568, 256), 
float32] */, axes=[1, 0],
+          /* en_id=599 */) /* ty=Tensor[(256, 1568), float32] */;
+      %11 = nn.dense(%9, %10, units=None, out_dtype="float32", /* en_id=600 
*/) /* ty=Tensor[(1, 256), float32] */;
+      %12 = add(%11, meta[relay.Constant][5] /* ty=Tensor[(256), float32] */,
+          /* en_id=601 */) /* ty=Tensor[(1, 256), float32] */;
+      %13 = nn.relu(%12, /* en_id=602 */) /* ty=Tensor[(1, 256), float32] */;
+      %14 = transpose(meta[relay.Constant][6] /* ty=Tensor[(256, 10), float32] 
*/, axes=[1, 0],
+          /* en_id=675 */) /* ty=Tensor[(10, 256), float32] */;
+      %15 = nn.dense(%13, %14, units=None, out_dtype="float32", /* en_id=676 
*/) /* ty=Tensor[(1, 10), float32] */;
+      add(%15, meta[relay.Constant][7] /* ty=Tensor[(10), float32] */, /* 
en_id=677 */) /* ty=Tensor[(1, 10), float32] */
+}
+
+```
+
+* We can get to the following one Mrvl subgraph by applying the default 
strategy.
+    * in the mrvl.py file: the compute_two_subgraphs() function of the class 
MrvlIRGraphUtils is used
+      to create mod_mrvl_subgraph and mod_non_mrvl_subgraph for
+
+```
+    def @main(%permute_input: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 
10), float32] {
+      %0 = @tvmgen_mrvl_main_0(%permute_input, /* en_id=4136 */) /* 
ty=Tensor[(1, 28, 28, 1), float32] */;
+      %1 = @tvmgen_mrvl_main_1(%0, /* en_id=4137 */) /* ty=Tensor[(1, 28, 28, 
64), float32] */;
+      %2 = @tvmgen_mrvl_main_2(%1, /* en_id=4138 */) /* ty=Tensor[(1, 14, 14, 
64), float32] */;
+      %3 = @tvmgen_mrvl_main_3(%2, /* en_id=4139 */) /* ty=Tensor[(1, 14, 14, 
32), float32] */;
+      %4 = @tvmgen_mrvl_main_4(%3, /* en_id=4140 */) /* ty=Tensor[(1, 7, 7, 
32), float32] */;
+      %5 = @tvmgen_mrvl_main_5(%4, /* en_id=4141 */) /* ty=Tensor[(1, 1568), 
float32] */;
+      %6 = @tvmgen_mrvl_main_6(%5, /* en_id=4142 */) /* ty=Tensor[(1, 256), 
float32] */;
+      @tvmgen_mrvl_main_7(%6, /* en_id=4143 */) /* ty=Tensor[(1, 10), float32] 
*/
+    }
+```
+
+* In the above Mrvl subgraph, it is formed by "not-yet optimized Marvell 
(backend) layers". For example,
+    tvmgen_mrvl_main_0 to tvmgen_mrvl_main_7 are composited/fused Marvell 
layers.
+    * In the mrvl.mrvl_pattern_table() function, fusing patterns have been 
defined in order to composite
+      original IR nodes into Marvell backend layers.
+    * For example, the following 3 IR call nodes (nn.conv2d + nn.bias_add + 
nn.relu) in the original IR graph
+      are composited into one Marvell layer: tvmgen_mrvl_main_1, conceptually 
speaking.
+```
+      # from original IR graphs

Review comment:
       nevermind, i see you are indeed reusing the device partition flow

##########
File path: rfcs/0048-BYOC-Marvell-ML-accelerator-integration.md
##########
@@ -0,0 +1,560 @@
+- Feature Name: (fill me in with a unique identifier, `my_awesome_feature`)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+- GitHub pre-RFC PR: 
[apache/tvm-PR-9730](https://github.com/apache/tvm/pull/9730)
+- GitHub pre-RFC discussion: 
[BYOC-Marvell](https://discuss.tvm.apache.org/t/pre-rfc-byoc-marvell-ml-ai-accelerator-integration/11691)
+
+# Summary
+[summary]: #summary
+
+Integrate Marvell’s ML/AI accelerator with TVM BYOC framework in order to 
bring the TVM ecosystem to Marvell customers.
+
+# Motivation
+[motivation]: #motivation
+
+Marvell MLIP is an ML/AI inference accelerator and is embedded on our ARM 
Neoverse N2-based OCTEON 10 processor.  We are building an easy-to-use, open, 
software suite for our customers by integrating and utilizing TVM so that we 
can bring TVM capability and experience to our customers.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+We follow what the TVM BYOC flow does (e.g., as done by others) to create our 
TVM-BYOC-Marvell POC code files and flow under the following folders -- refer 
to the up-loaded appache/tvm-PR-9730 POC for details:
+
+```
+  - cmake/modules/contrib/Mrvl.cmake
+  - python/tvm/relay/op/contrib/mrvl.py
+  - src/relay/backend/contrib/mrvl/codegen.cc, drop_noop_transpose.cc,
+    graph_executor_codegen_mrvl.cc
+  - src/runtime/contrib/mrvl/mrvl_runtime.cc
+  - tests/python/contrib/test_mrvl/__init__.py, infrastructure.py,
+    test_mrvl_codegen.py
+  - plus, other corresponding changes
+```
+
+Based on what Marvell ML/AI inference accelerator does the best, a given 
pre-trained network model
+will be applied to a TVM-BYOC-Marvell AOT compilation and code-gen flow as 
illustrated in Figure1 and
+STEPs (1), (2), (3a), (3b), and (4) below.
+
+### Figure 1: TVM-BYOC-Marvell AOT Compilation, Code-gen Flow
+![](./assets/0048/figure1-flow.png)
+
+### STEP (1) Run TVM-BYOC-Marvell AOT ML Frontend Compilation and 
TVM-BYOC-Marvell code-gen using typical TVM flow.
+
+The main input to STEP (1) is a pre-trained ONNX or MXNet model; and two 
outputs coming out of STEP (1) include a pair of Nodes-JSON file and 
Constants-JSON file for each Marvell sub-graph. This pair of JSON files 
represents the meta-data information of a Marvell sub-graph, which is a part of 
the given pre-trained model identified by the TVM-BYOC-Marvell flow.
+
+Utilizing up-loaded POC changes in appache/tvm-PR-9730, sample code snippet 
for STEP (1) is illustrated below:
+
+```
+  import tvm
+  from tvm import relay
+  from tvm.relay.op.contrib import mrvl
+  from gluoncv import model_zoo, data, utils
+
+  ...
+
+  ssd_resnet50 = model_zoo.get_model("ssd_512_resnet50_v1_voc", 
pretrained=True)
+  inp_shape = (1, 3, 512, 512)
+  raw_model_ir, weight_bias_params = relay.frontend.from_mxnet(model, {"data": 
inp_shape})
+
+  # call mrvl.partition_for_mrvl()
+  (model_mrvl, model_other, orig_params, opt_level, disabled_pass, orig_mod,
+      mrvl_layers_in_mrvl_subgraph) = mrvl.partition_for_mrvl(
+      raw_model_ir, params=weight_bias_params, tvm_custom_dict={},
+      gen_non_mrvl_subgraph=False, flow_pass=1)
+
+  # call relay.build() and mrvl.dump_json_meta_data_files()
+  build_target, device_id = "llvm", 0
+  mod_name = relay.backend.utils.mangle_module_name("")
+  byoc_executor = relay.build(model_mrvl, target=build_target, 
mod_name=mod_name)
+  byoc_const_params = byoc_executor.get_params()
+  byoc_external_graph_json = byoc_executor.get_external_graph_json()
+  nodes_json_filename, consts_json_filename = mrvl.dump_json_meta_data_files(
+      byoc_external_graph_json, byoc_const_params,
+      filename_prefix=f"{model_name}-tvm-mrvl-byoc-ir")
+...
+```
+
+First, we can download a pre-trained SSD-ResNet50 model from the MXNet-gluoncv 
site; then, call the mrvl.partition\_for\_mrvl() function to trigger the 
TVM-BYOC-Marvell flow; and finally, call relay.build() function and 
mrvl.dump\_json\_meta\_data\_files() function to generate a pair of JSON files 
for each Marvell sub-graph identified by the TVM-BYOC-Marvell flow.
+
+We are calling the byoc\_executor.get\_external\_graph\_json() function and 
the byoc\_executor.get\_params() function in order to generate both Nodes-JSON 
file and Constants-JSON file, respectively.
+
+* The get\_external\_graph\_json() function is a new addition to Python class 
BuildModule(object).
+* The get\_params() function exists for Python class BuildModule(object), but 
to make it work, we need to disable the "removal external params" CPP code 
block in the CPP class RelayBuildModule.
+
+Sub steps involved in STEP (1) are (refer to Figures 1, 2a, 2b, 3 with 
descriptions below):
+
+* Load pre-trained network into TVM IR graph.
+* Do Marvell-specific layout conversions to transform IR graph in order to 
meet requirements of the accelerator.
+* Do Marvell-specific composite-merging/fusing to transform IR graph in order 
to utilize available HW capability in the accelerator.
+* Do additional Marvell-specific transform pass(es) to further optimize IR 
graph.
+* Partition IR graph into one or more for-accelerator Marvell sub-graphs 
and/or one or more LLVM-non-Marvell sub-graphs (e.g., for running inference on 
ARMv9):
+
+    * These sub-graphs cover the whole pre-trained network.
+
+    * For-accelerator Marvell sub-graph here means & contains a set of 
connected, composite-merged/fused Call nodes (i.e., not just one 
compoiste-merged/fused Call node function).  NOTE: the term sub-graph defined 
here can be different from existing TVM sub-graph definition.
+
+    * As shown in Figure 2a, a pre-trained CNN ONNX model (on the left) is 
processed by the TVM-BYOC-Marvell flow into only one Marvell sub-graph 
(illustrated in the middle of Figure 2a) where operators of given ONNX model 
are composite-merged/fused into 8 fused composition function in the Marvell 
sub-graph. For example, near bottom left a set of MatMul + Add + Relu operators 
of the ONNX model are fused into one tvmgen\_mrvl\_main\_7 composition function 
in the Marvell sub-graph.
+
+    * As another example in Figure 2b, given the same CNN ONNX model, we can 
apply a different argument value but this time to ask the TVM-BYOC-Marvell 
flow, mrvl.partition\_for\_mrvl(...), to identify one Marvell sub-graph of 4 
fused composition Call node functions and another LLVM-non-Marvell sub-graph as 
illustrated in the middle top sub-graph A and in the middle bottom sub-graph B, 
respectively.  This special argument value can lead to different inference 
performance in terms of meeting latency, bandwidth, and/or memory requirements.
+
+    * For the first TVM-BYOC-Marvell revision, at most one for-accelerator 
Marvell sub-graph and at most one LLVM-non-Marvell sub-graph can be identified; 
plus, the for-accelerator Marvell sub-graph can only use input tensor(s) of 
given pre-trained network as its sub-graph’s input tensors.
+
+    * Figure 3 illustrate how a complex Marvell sub-graph can look like. The 
whole sub-graph shown here represents a Marvell sub-graph of more than 100 
fused compositions Call node functions and it comes from the pre-trained 
SSD-ResNet50 MXNet model. The LLVM-non-Marvell sub-graph part of the 
SSD-ResNet50 model is not displayed here but it contains rest of the 
object-detection part of the model in order to finalize 2D-BBOXes and labels.
+
+* Do code-gen step for each Marvell sub-graph by producing pair of Nodes-JSON 
and Constants-JSON files:
+
+    * The TVM-BYOC-Marvell flow also pecifies Marvell attributes for each 
composite-merged/fused Call node function so that generated Nodes-JSON file(s) 
and Constants-JSON file(s) can represent the meta-data inforamtion of Marvell 
sub-graph(s) in order to do post-processing.
+
+    * RFC reviewer feedback: can we identify the Marvell sub-graph by running 
a TIR-only pass after scheduling (with the potential benefit to also operate on 
the logical TIR buffers)? Marvell developer can and will spend time on 
understand the TIR flow and its pass to find out.
+
+![](./assets/0048/figure2a-onnx-1-mrvl-sub-graph-backend-layers.png)
+
+![](./assets/0048/figure2b-onnx-mrvl-sub-graph-A-llvm-sub-graph-B.png)
+
+![](./assets/0048/figure3-sample-mrvl-sub-graph-for-ssd-resnet50.png)
+
+
+### STEP (2) Run Marvell-ML/AI Backend Compiler to generate model binary for 
each Marvell sub-graph
+
+* As shown in middle left section of Figure 1, labeled as (2), we will 
execute, outside of the typical TVM flow, the Marvell-ML/AI backend compiler 
program to post-process Nodes-JSON and Constants-JSON files of each Marvell 
sub-graph in order to generate final ISA instructions (in a Marvell model 
binary file) to run inference on Marvell accelerator.
+
+* The Marvell-ML/AI backend compiler program will be distributed as: 
mrvl-tvmircomp. For example, the command line below can be used to generate the 
model binary file for a pair of CNN JSON files to run fp16-based inference by 
utilizing 1M bytes of On-Chip memory on each of 4 HW compute tiles:
+
+```
+  $ mrvl-tvmircomp --model_name cnn --nodes cnn-tvm-mrvl-byoc-ir.json \
+        --consts cnn-tvm-mrvl-byoc-const.json \
+        --arch=MLIP --dram_addr_relocatable=1 --ocm_base=0x0 
-ocm_size=0x100000 \
+        --num_tiles=4 --quantize=float16
+
+  note: the output model binary file generated is: cnn.bin
+
+```
+
+* Marvell backend compiler does additional optimizations AOT including to 
group, allocate, and map layer-based tensors and computes onto pre-allocated 
resources (such as above: 4 compute tiles and 1M bytes on each of 4 tiles) 
avaialble on the Marvell accelerator.  Sample layer-based structures used by 
ISA instructions for the CNN model are illustrated in the right most column in 
both Figure 2a and Figure 2b.
+
+* Note: Marvell ML/AI accelerator can run inference in either float16 mode or 
int8 quantization mode. For this RFC, we will focus only on float16 AOT 
compilation to run float16 inference.
+
+* Note: Marvell can provide a mrvl-tvmircomp executable to TVM CI environment 
to run TVM Jenkins build & tests.
+
+
+### STEP (3a) or (3b) Run inference on the Software Simulator or on the 
Marvell ML/AI HW accelerator for the Marvell sub-graph
+
+* As illustrated in the middle left section of Figure 1, labeled as (3a), a 
cycle-approximate Marvell Software Simulator, mlModel, which cycle 
approximately mimics the Marvell ML/AI HW accelerator, will be distributed, The 
Marvell Software Simulator can be used to read in a Marvell model binary file 
and its corresponding inference input file(s) to run inference to generate 
results for the Marvell sub-graph. For example, the command line below can be 
used to run inference:
+
+```
+  $ mlModel --model_binary cnn.bin --inputs cnn_input/input1.bin --arch=MLIP 
--perf_debug
+
+  note1: the inference output will be saved at: cnn-output.bin
+  note2: optionally, cycle level information for performance debug can also 
dump
+
+```
+
+* Note: Marvell can provide a mlModel executable to TVM CI environment to run 
TVM Jenkins build & tests.
+
+* Also as illustrated on the right side of Figure 1, labeled as (3b), tools, 
driver and firmware are available such that they can be used to run inference 
on an Marvell ML/AI inference HW accelerator.
+
+
+### STEP (4) Use TVM-LLVM Compiler & Runtime to run inference for the 
LLVM-non-Marvell sub-graph
+
+* As illustrated in the bottom left section of Figure 1, labeled as (4), an 
integration step between sub-graph(s) need to be done at inference runtime in 
order to run full inference for the given pre-trained model. We can use 
TVM-LLVM flow to generate runtime .so binary for each LLVM-non-Marvell 
sub-graph.  POC code for STEP (4) is not yet ready (WIP) and is not included in 
the uploaded appache/tvm-PR-9730.
+
+* For the first BYOC-Marvell revision, at most one integration step from a 
for-accelerator Marvell sub-graph to a LLVM-non-Marvell sub-graph is 
implemented.
+
+### Exercise TVM-BYOC-Marvell flow
+
+To exercise the TVM-BYOC-Marvell flow, we have provided a 
tests/python/contrib/test\_mrvl folder with test\_mrvl\_codegen.py and 
infrastructure.py files so that they shows how to exercise the TVM-BYOC-Marvell 
flow for a pre-trained SSD-ResNet50 model.  In addition, Marvell are also 
planning to provide the Marvell backend compiler (mrvl-tvmircomp) and the 
Marvell HW accelerator software simulator (mlModel) so that they can be used to 
read in JSON files generated by the TVM-BYOC-Marvell flow to run inference to 
get results.
+
+In the uploaded appache/tvm-PR-9730 branch,
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+### Illustration using a MNIST model
+
+Let's use a Keras MNIST fashion model below as an example (partial & pseudo 
code for illustration). Please also refer to files of the uploaded 
appache/tvm-PR-9730 for details.
+
+```
+  Get Input-Fashion-Image-Tensor-nchw - input_shape: [1, 1, 28, 28]
+
+  keras.Input(shape=input_shape)
+  keras.layers.Conv2D(64, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Conv2D(32, kernel_size=(2, 2), activation="relu")
+  keras.layers.MaxPooling2D(pool_size=(2, 2))
+  keras.layers.Dropout(0.3)
+  keras.layers.Reshape()
+  keras.layers.Dense(256, activation="relu")
+  keras.layers.Dense(10)
+
+  Generate Output-Tensor - output_shape: [1, 10]
+
+  top_label_id = numpy.argmax(Output-Tensor)
+  # fashion label map
+  fashion_label_dictionary = {
+      0: "T-shirt/top",
+      1: "Trouser",
+      2: "Pullover",
+      3: "Dress",
+      4: "Coat",
+      5: "Sandal",
+      6: "Shirt",
+      7: "Sneaker",
+      8: "Bag",
+      9: "Ankle boot",
+  }
+  print(f"Fashion item identified as: 
{fashion_label_dictionary[top_label_id]}")
+```
+
+We can train the above MNIST fashion model using the following train\_images 
dataset and save the pre-trained model in ONNX (say, mnist\_fashion.onnx). 
Then, we can run BYOC Marvell flow by giving any image of the 
orig\_test\_images[i] dataset to get its inference fashion label and item name 
in top\_label\_id and fashion\_label\_dictionary[top\_label\_id], respectively. 
In addition, we can also use the corresponding golden label, 
golden\_output\_labels[i], to validate the inference result.
+
+```
+  (train_images, train_labels), (
+      orig_test_images,
+      golden_output_labels,
+  ) = keras.datasets.fashion_mnist.load_data()
+```
+
+In the code snippet below, we call onnx.load() and relay.frontend.from\_onnx() 
to generate TVM mod and params. Then, they are used by the 
mrvl.partition\_for\_mrvl() function and the 
mrvl.dump\_json\_meta\_data\_files() function provided for the TVM-BYOC-Marvell 
flow to generate Nodes-JSON file (nodes\_json\_filename) and Constants-JSON 
file (consts\_json\_filename).

Review comment:
       in the PoC PR, `partition_for_mrvl` is registered in 
python/tvm/driver/tvmc/composite_target.py along with the other BYOC 
partitioners, but its signature differs significantly (from the de-facto 
`partition_func(IRModule) -> IRModule`):
   ```
       """Partition the graph greedily offloading supported
       operators to Mrvl
   
       Parameters
       ----------
       mod : Module
           The module to run passes on.
       params : Optional[Dict[str, NDArray]]
           Constant input parameters.
   
       Returns
       -------
       mod_mrvl : annotated and partitioned module - part 1, the mrvl sub graph
       mod_other : annotated and partitioned module - part 2, if any, the rest 
sub graph
       params : TBA
       opt_level : TBA
       disabled_pass_list : TBA
       mod : TBA
       mrvl_layers_in_mrvl_subgraph : TBA
       """
   ```
   
   what's your intention here?  in order to register this function in 
`REGISTERED_CODEGEN`, you'll need to make that signature match up. however, i 
think from my reading, what's happening here is you're invoking a fair bit of 
the compilation pipeline underneath a hard-coded PassContext, then returning a 
fair bit of extra information here. some of this information looks fairly 
specific to the Marvell lowering flow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] areusch commented on a change in pull request #48: [RFC][BYOC] Marvell ML/AI Accelerator Integration

Reply via email to