[tvm] branch main updated: [CLML][CODEGEN] CLML native codegen utility (#13837)

echuraev Thu, 02 Feb 2023 22:06:25 -0800

This is an automated email from the ASF dual-hosted git repository.

echuraev pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm.git



The following commit(s) were added to refs/heads/main by this push:
     new d35a8ab135 [CLML][CODEGEN] CLML native codegen utility (#13837)
d35a8ab135 is described below

commit d35a8ab1353afc40317396b2ddfda8f35a99ba8a
Author: Siva <[email protected]>
AuthorDate: Fri Feb 3 11:35:55 2023 +0530

    [CLML][CODEGEN] CLML native codegen utility (#13837)
    
    * [CLML][CODEGEN] CLML native codegen utility
    
    This util generates native CLML code given a DNN model.
    It does import via tvmc, extracts clml_modules, get the json source and
    finally generates clml_models.cc that holds source for various sub graphs.
    cpp_clml tool has additional infrastructure to compile it as a standalong
    binary that runs these models.
    
    This PR adds symbol name to the generates json grpah.
    Also, extends const_loader interface to get constant params.
    
    * * review comments
    
    * * review
    
    * * review
---
 apps/cpp_clml/CMakeLists.txt                       |  61 ++
 apps/cpp_clml/README.md                            | 145 ++++
 apps/cpp_clml/clml_runner.cc                       | 818 +++++++++++++++++++++
 apps/cpp_clml/clml_runner.h                        | 262 +++++++
 apps/cpp_clml/main.cc                              | 243 ++++++
 apps/cpp_clml/scripts/clml_codegen.py              |  64 ++
 cmake/modules/contrib/CLML.cmake                   |   2 +-
 docker/Dockerfile.ci_adreno                        |   3 +
 python/tvm/relay/op/contrib/clml.py                | 772 +++++++++++++++++++
 .../backend/contrib/codegen_json/codegen_json.h    |   1 +
 src/runtime/const_loader_module.cc                 |  10 +
 src/runtime/contrib/json/json_runtime.h            |   3 +
 12 files changed, 2383 insertions(+), 1 deletion(-)

diff --git a/apps/cpp_clml/CMakeLists.txt b/apps/cpp_clml/CMakeLists.txt
new file mode 100644
index 0000000000..8c0fd53bf9
--- /dev/null
+++ b/apps/cpp_clml/CMakeLists.txt
@@ -0,0 +1,61 @@
+cmake_minimum_required(VERSION 3.13)
+
+project(clml_run VERSION 2.0)
+
+if(NOT DEFINED CMAKE_TOOLCHAIN_FILE)
+  message( FATAL_ERROR "CMAKE_TOOLCHAIN_FILE Not set, forcing exit. Suggested 
value: {ANDROID_NDK_PATH}/build/cmake/android.toolchain.cmake." )
+endif(NOT DEFINED CMAKE_TOOLCHAIN_FILE)
+
+if(NOT DEFINED ANDROID_ABI)
+  message( FATAL_ERROR "ANDROID_ABI Not set, forcing exit. Suggested value(s): 
arm64-v8a (64), armeabi-v7a (32)" )
+endif(NOT DEFINED ANDROID_ABI)
+
+if(NOT DEFINED CLML_SDK)
+  message( FATAL_ERROR "CLML_SDK Not set, forcing exit." )
+endif(NOT DEFINED CLML_SDK)
+
+if (CMAKE_FIND_ROOT_PATH_MODE_LIBRARY STREQUAL "ONLY")
+  set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY BOTH)
+endif()
+
+find_library(CLML_LIBRARIES NAMES libOpenCL.so NO_DEFAULT_PATH PATHS 
${CLML_SDK}/lib ${CLML_SDK}/lib64)
+
+# CMake/Android variables
+set( ANDROID_STL  c++_static CACHE STRING "Target Android STL") # default
+
+# Source variables
+set( OPENCL_INCLUDE_DIRS  ${CLML_SDK} CACHE PATH "filepath to OpenCL headers")
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED True)
+
+#we do not want to pass -fno-exceptions
+if(${CMAKE_CXX_FLAGS} MATCHES "-fno-exceptions")
+  message ( WARNING "Disabling -fno-exceptions")
+  string(REGEX REPLACE "-fno-exceptions" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
+endif()
+
+#we do not want to pass -fno-rtti
+if(${CMAKE_CXX_FLAGS} MATCHES "-fno-rtti")
+  message ( WARNING "Disabling -fno-rtti")
+  string(REGEX REPLACE "-fno-rtti" "" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
+endif()
+
+set(COMMON_SOURCE_FILES
+        clml_models.cc
+        clml_runner.cc
+        clml_runner.h
+        main.cc
+        ../../3rdparty/cnpy/cnpy.cpp
+        )
+
+include_directories(
+        src
+        ${OPENCL_INCLUDE_DIRS}
+        "../../3rdparty/dmlc-core/include"
+        "../../3rdparty/cnpy/"
+        )
+
+add_executable(clml_run ${COMMON_SOURCE_FILES})
+target_link_options(clml_run PRIVATE 
-Wl,--unresolved-symbols=ignore-in-shared-libs)
+target_link_libraries(clml_run ${CLML_LIBRARIES} z)
diff --git a/apps/cpp_clml/README.md b/apps/cpp_clml/README.md
new file mode 100644
index 0000000000..3200492122
--- /dev/null
+++ b/apps/cpp_clml/README.md
@@ -0,0 +1,145 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+# OpenCLML Debug Tool
+
+Tool to generate OpenCLML source file given a model from any framework and 
compile it as a native application that runs on Android target.
+This tool helps to debug or triage OpenCLML offloaded sub graphs as a 
standalone application.
+
+### Codegen
+
+Models can be downloaded from well known frameworks like Tensorflow, PyTorch, 
TFLite, Onnx ..etc.
+Assuming  ```resnet50.h5``` is a Keras ResNet50 model file, use the below 
command to generate a OpenCLML source for the model.
+
+```bash
+python3 scripts/clml_codegen.py resnet50.h5
+```
+
+Above command generates ```clml_models.cc``` and ```clml_params.npz```.
+```clml_models.cc``` contains cpp representation of all OpenCLML subgraphs 
offloaded by TVM compilation. This file will be used to build tool 
```clml_run```.
+```clml_params.npz``` is a numpy dump of all params involved in all sub graphs 
of TVM module. This file to be copied to target.
+
+### Build Tool
+
+Copy the generated models source ```clml_models.cc``` under ```cpp_clml```.
+
+Below commands will compile the tool ```clml_run``` from generated source and 
other static dependents.
+
+```bash
+cmake -S . -B build_64 -D ANDROID_ABI=arm64-v8a -D CLML_SDK=<CLML SDK PATH> -D 
CMAKE_TOOLCHAIN_FILE=<ANDROID NDK PATH>/build/cmake/android.toolchain.cmake -D 
ANDROID_PLATFORM=latest
+cmake --build build_64
+```
+
+### Run the tool
+
+Copy ```clml_params.npz``` and ```clml_run``` to the target Android device
+
+```bash
+Android:/data/local/tmp $ ./clml_run --dump-meta
+Input         =
+Output        =
+Params        =
+DumpMeta      = 1
+.....
+Subgraph Name: tvmgen_default_clml_main_1
+    Input Count  : 1
+    Output Count : 1
+    Input MetaInfo
+        Input: tvmgen_default_clml_main_1_input_0
+            Dtype : float32
+            Shape : [1, 1, 1, 2048]
+    Output MetaInfo
+        Output: tvmgen_default_clml_main_1_layer_out_5
+            Dtype : float32
+            Shape : [1, 1000]
+
+Subgraph Name: tvmgen_default_clml_main_0
+    Input Count  : 1
+    Output Count : 1
+    Input MetaInfo
+        Input: tvmgen_default_clml_main_0_input_0
+            Dtype : float32
+            Shape : [1, 3, 230, 230]
+    Output MetaInfo
+        Output: tvmgen_default_clml_main_0_layer_out_406
+            Dtype : float32
+            Shape : [1, 2048, 1, 1]
+.....
+```
+
+The meta information above indicates that the ResNet50 model is partitioned 
such a way that there exists two OpenCLML subgraphs.
+
+Below command runs the models by setting the parameters from 
```clml_params.npz```.
+
+```bash
+Android:/data/local/tmp $ ./clml_run --params=./clml_params.npz
+Input         =
+Output        =
+Params        = ./clml_params.npz
+DumpMeta      = 1
+......
+CLMLRunner Loading Params:./clml_params.npz
+CLMLRunner Loading Params:./clml_params.npz
+CLMLRunner::Run :tvmgen_default_clml_main_1
+CLMLRunner::Run :tvmgen_default_clml_main_0
+......
+```
+
+Below command can set the model inputs from ```input.npz```  and can output 
sub graph outputs to ```output.npz```.
+```input.npz``` should have numpy arrays for 
```tvmgen_default_clml_main_1_input_0``` from sub graph 
```tvmgen_default_clml_main_1``` and ```tvmgen_default_clml_main_0_input_0``` 
from sub graph ```tvmgen_default_clml_main_0```.
+
+```bash
+Android:/data/local/tmp $ ./clml_run --params=./clml_params.npz 
--input=./input.npz --output=./output.npz                                       
                                <
+Input         = ./input.npz
+Output        = ./output.npz
+Params        = ./clml_params.npz
+DumpMeta      = 0
+Call Build Modules
+CLMLRunner Constructor: Input:./input.npz Output:./output.npz 
Params:./clml_params.npz
+CLML Target version:3
+CLMLRunner Loading Params:./clml_params.npz
+CLMLRunner Loading Inputs:./input.npz
+Set Input For:tvmgen_default_clml_main_1_input_0
+
+CLMLRunner Constructor: Input:./input.npz Output:./output.npz 
Params:./clml_params.npz
+CLML Target version:3
+CLMLRunner Loading Params:./clml_params.npz
+CLMLRunner Loading Inputs:./input.npz
+Set Input For:tvmgen_default_clml_main_0_input_0
+
+Loop Through the Modules
+CLMLRunner::Run :tvmgen_default_clml_main_1
+Saving Output:tvmgen_default_clml_main_1_layer_out_5
+CLMLRunner::Run :tvmgen_default_clml_main_0
+Saving Output:tvmgen_default_clml_main_0_layer_out_406
+......
+```
+
+The generated output file ```output.npz``` contains all the output from all 
sub modules.
+In this case it contains ```tvmgen_default_clml_main_1_layer_out_5``` for sub 
graph ```tvmgen_default_clml_main_1``` and 
```tvmgen_default_clml_main_0_layer_out_406``` for sub graph 
```tvmgen_default_clml_main_0``` as shown below.
+
+
+```bash
+Android:/data/local/tmp $ unzip -l output.npz
+Archive:  output.npz
+  Length      Date    Time    Name
+---------  ---------- -----   ----
+     4080  1980-00-00 00:00   tvmgen_default_clml_main_1_layer_out_5.npy
+     8272  1980-00-00 00:00   tvmgen_default_clml_main_0_layer_out_406.npy
+---------                     -------
+    12352                     2 files
+```
diff --git a/apps/cpp_clml/clml_runner.cc b/apps/cpp_clml/clml_runner.cc
new file mode 100644
index 0000000000..d733922da4
--- /dev/null
+++ b/apps/cpp_clml/clml_runner.cc
@@ -0,0 +1,818 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file clml_runner.cc
+ * \brief CLML model runner implementation.
+ */
+
+#include "clml_runner.h"
+
+#include <fstream>
+#include <iostream>
+#include <streambuf>
+#include <string>
+
+namespace tvm {
+namespace runtime {
+
+/*!
+ * \brief Constructor for CLMLRunner.
+ * \param name is unique name for the sub graph or this CLML Runner.
+ * \param args tool or utility arguments.
+ * \param arg_platform_id is the OpenCL platform.
+ * \param arg_context is the OpenCL context.
+ * \param arg_device_id is the OpenCL device_id.
+ * \param arg_queue is the OpenCL queue.
+ */
+CLMLRunner::CLMLRunner(std::string name, ToolArgs& args, cl_platform_id 
arg_platform_id,
+                       cl_context arg_context, cl_device_id arg_device_id,
+                       cl_command_queue arg_queue)
+    : r_args(args),
+      r_name(name),
+      platform(arg_platform_id),
+      context(arg_context),
+      device_id(arg_device_id),
+      queue(arg_queue) {
+  LOG(INFO) << "CLMLRunner Constructor: Input:" << r_args.input << " Output:" 
<< r_args.output
+            << " Params:" << r_args.params;
+  cl_int result;
+
+  // Query and Get CLML Interface
+  static const cl_uint MAX_VERSIONS = 256;
+  cl_int majorVersions[MAX_VERSIONS];
+  cl_int minorVersions[MAX_VERSIONS];
+  cl_uint numVersions = 0;
+  result = clQueryMLInterfaceVersionsQCOM(nullptr, nullptr, 0, &numVersions);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+  CLML_SDK_TEST_AND_EXIT(numVersions > 0u);
+  CLML_SDK_TEST_AND_EXIT(numVersions <= MAX_VERSIONS);
+
+  result = clQueryMLInterfaceVersionsQCOM(majorVersions, minorVersions, 
numVersions, nullptr);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  for (cl_uint i = 0; i < numVersions; ++i) {
+    if (majorVersions[i] == CL_QCOM_ML_OPS_H_MAJOR_VERSION) {
+      this->h_ClmlIntf = GET_ML_INTERFACE(0);
+      LOG(INFO) << "CLML Target version:" << majorVersions[i];
+      break;
+    }
+  }
+  CLML_SDK_TEST_AND_EXIT(this->h_ClmlIntf != nullptr);
+
+  result = h_ClmlIntf->clCreateMLTuningCacheQCOM(&tuning_cache);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  if (!r_args.params.empty()) {
+    LOG(INFO) << "CLMLRunner Loading Params:" << r_args.params;
+    npz_params = cnpy::npz_load(r_args.params);
+  } else {
+    LOG(INFO) << "CLMLRunner : No parameters supplied";
+  }
+
+  if (!r_args.input.empty()) {
+    LOG(INFO) << "CLMLRunner Loading Inputs:" << r_args.input;
+    npz_input = cnpy::npz_load(r_args.input);
+  } else {
+    LOG(INFO) << "CLMLRunner : No Input's given. Asuming a dry-run.";
+  }
+}
+
+/*!
+ * \brief Call one cycle of execution for the model.
+ * \return 0 on success else error code.
+ */
+int CLMLRunner::Run(void) {
+  LOG(INFO) << "CLMLRunner::Run :" << GetModName();
+  cl_int result;
+
+  for (size_t i = 0; i < this->function.size(); ++i) {
+    result = h_ClmlIntf->clEnqueueMLOpQCOM(queue, this->function[i], 
this->descriptorSet, 0,
+                                           nullptr, nullptr);
+    CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+  }
+  if (!r_args.output.empty()) {
+    for (auto it = this->outputs.begin(); it != this->outputs.end(); it++) {
+      auto out_name = it->first;
+      auto out_desc = it->second;
+      auto dtype = outputs_dtypes[out_name];
+      auto shape = outputs_shapes[out_name];
+      size_t size = 1;
+      for (auto si : shape) size *= si;
+      if (dtype == "float32") {
+        void* data = (void*)malloc(size * 4);
+        CopyDataFromCLMLTensor(out_desc, data);
+        LOG(INFO) << "Saving Output:" << out_name;
+        cnpy::npz_save<float>(r_args.output, out_name, (float*)data, shape, 
"a");
+        free(data);
+      } else if (dtype == "int8") {
+        void* data = (void*)malloc(size);
+        CopyDataFromCLMLTensor(out_desc, data);
+        LOG(INFO) << "Saving Output:" << out_name;
+        cnpy::npz_save<int8_t>(r_args.output, out_name, (int8_t*)data, shape, 
"a");
+        free(data);
+      } else {
+        LOG(WARNING) << "Unsupported dtype to dump :" << dtype;
+      }
+    }
+  }
+  return 0;
+}
+
+/*!
+ * \brief Set meta information.
+ * \param minfo is the meta information of the sub graph.
+ */
+void CLMLRunner::SetMetaInfo(std::string minfo) { this->meta_info = minfo; }
+
+/*!
+ * \brief Print the meta information.
+ */
+void CLMLRunner::PrintMetaInfo(void) { LOG(INFO) << "\n" << this->meta_info; }
+
+/*!
+ * \brief Copy the bytedata into tensor.
+ * \param tensor is tensor descriptor to copy data.
+ * \param data is pointer to bytedata.
+ * \param layout is source data layout
+ */
+void 
CLMLRunner::CopyDataToCLMLTensor(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
tensor,
+                                      void* data, cl_ml_tensor_layout_qcom 
layout) {
+  cl_int result = 0;
+  cl_event evt = nullptr;
+  result = h_ClmlIntf->clEnqueueWriteMLTensorDataQCOM(this->queue, data, 
layout, tensor->tensor,
+                                                      tensor->memory,
+                                                      0,        // n waitlist
+                                                      nullptr,  // waitlist
+                                                      &evt);    // event
+  CLML_SDK_TEST_AND_EXIT((evt != nullptr) && result == CL_SUCCESS);
+}
+
+/*!
+ * \brief Copy the bytedata into tensor.
+ * \param tensor is tensor descriptor to copy data.
+ * \param data is pointer to bytedata.
+ * \param layout is source data layout
+ */
+void 
CLMLRunner::CopyDataFromCLMLTensor(std::shared_ptr<cl_ml_tensor_memory_desc_qcom>
 tensor,
+                                        void* data, cl_ml_tensor_layout_qcom 
layout) {
+  cl_int result = 0;
+  cl_event readEvent = nullptr;
+  // Read the output tensor
+  result = h_ClmlIntf->clEnqueueReadMLTensorDataQCOM(this->queue, 
tensor->tensor, tensor->memory,
+                                                     data, layout,
+                                                     0,            // n 
waitlist
+                                                     nullptr,      // waitlist
+                                                     &readEvent);  // event
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+  result = clWaitForEvents(1, &readEvent);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+}
+
+/*!
+ * \brief Allocate backing memory for tensor descriptor.
+ * \param pTensorMemDesc is tensor descriptor.
+ * \return memory alocation status (CL_SUCCESS or error code).
+ */
+cl_int CLMLRunner::AllocateTensorMemory(
+    std::shared_ptr<cl_ml_tensor_memory_desc_qcom> pTensorMemDesc) {
+  uint32_t size = 0;
+  cl_int result = CL_OUT_OF_HOST_MEMORY;
+  cl_mem buffer = nullptr;
+
+  result = h_ClmlIntf->clGetMLTensorMemorySizeQCOM(context, 
pTensorMemDesc->tensor, &size);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, nullptr, &result);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  pTensorMemDesc->memory = buffer;
+
+  return result;
+}
+
+/*!
+ * \brief Allocate memory for all tensor dectiptor in storage map.
+ * Also set data for tensors given params and input numpy dumps
+ */
+void CLMLRunner::AllocateMemAndPopulateParams(void) {
+  cl_int result;
+  for (auto it = this->storage_map.begin(); it != this->storage_map.end(); 
it++) {
+    auto node_id = it->first;
+    auto tensor_desc = it->second;
+
+    AllocateTensorMemory(tensor_desc);
+
+    if (npz_params.find(node_id) != npz_params.end()) {
+      CopyDataToCLMLTensor(tensor_desc, npz_params[node_id].data<char>());
+    }
+
+    if (npz_input.find(node_id) != npz_input.end()) {
+      LOG(INFO) << "Set Input For:" << node_id;
+      CopyDataToCLMLTensor(tensor_desc, npz_input[node_id].data<char>());
+    }
+
+    this->tensorMemDescs.push_back(*tensor_desc);
+  }
+  if (!r_args.dump_meta) {
+    // Cross check all params
+    for (auto nid : consts) {
+      if (npz_params.find(nid) == npz_params.end()) {
+        LOG(WARNING) << "Param not found in npz:" << nid;
+      }
+    }
+  }
+  // Initialize Tensor Descriptors
+  result = 
h_ClmlIntf->clCreateMLTensorMemoryDescriptorSetQCOM(&this->descriptorSet);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  result = h_ClmlIntf->clUpdateMLTensorMemoryDescriptorSetQCOM(
+      this->descriptorSet, static_cast<uint32_t>(this->tensorMemDescs.size()),
+      this->tensorMemDescs.data());
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+}
+
+/*!
+ * \brief Initializes an unused tensor.
+ * It is used across operators.
+ */
+void CLMLRunner::MakeUnusedTensor(void) {
+  cl_int result;
+  cl_ml_tensor_desc_qcom desc = {};
+  desc.num_dimensions = CL_TENSOR_UNUSED_QCOM;
+  this->unusedTensor = std::make_shared<cl_ml_tensor_memory_desc_qcom>();
+  result = this->h_ClmlIntf->clCreateMLTensorQCOM(this->context, nullptr, 
&desc,
+                                                  
&(this->unusedTensor->tensor));
+  CLML_SDK_TEST_AND_EXIT(this->unusedTensor && result == CL_SUCCESS);
+}
+
+/*!
+ * \brief Convert string datatype to cl channel type.
+ * \param dtype the datatype as string.
+ * \return cl channel type corresponding to the datatype.
+ */
+cl_channel_type MakeCLDataType(const std::string& dtype) {
+  if (dtype == "float32") {
+    return CL_FLOAT;
+  } else if (dtype == "float16") {
+    return CL_HALF_FLOAT;
+  } else {
+    LOG(FATAL) << "Datatype: " << dtype << " unsupported by CLML runtime";
+  }
+  return CL_FLOAT;
+}
+
+/*!
+ * \brief Map operator arthemetic mode based on data type and accumulation 
type.
+ * \param data_type is cl channel type for computation.
+ * \param acc_tpe is cl channel type for accumulation.
+ * \return the arthemetic mode.
+ */
+cl_arithmetic_mode_qcom MakeCLArithMode(const cl_channel_type& data_type,
+                                        const cl_channel_type& acc_type = 
CL_FLOAT) {
+  if (data_type == CL_FLOAT && acc_type == CL_FLOAT) {
+    return CL_ARITHMETIC_MODE_FP32_QCOM;
+  } else if (data_type == CL_HALF_FLOAT && acc_type == CL_FLOAT) {
+    return CL_ARITHMETIC_MODE_FP16_ACC32_QCOM;
+  } else if (data_type == CL_HALF_FLOAT && acc_type == CL_HALF_FLOAT) {
+    return CL_ARITHMETIC_MODE_FP16_QCOM;
+  } else {
+    LOG(FATAL) << "Datatype " << data_type << " unsupported by CLML runtime";
+  }
+}
+
+/*!
+ * \brief Creates a tensor descriptor.
+ * \param shape is shape of tensor.
+ * \param dtype tensor data type as string.
+ * \param layout is the data layout to be used.
+ * \return newly created tensor descriptor.
+ */
+std::shared_ptr<cl_ml_tensor_memory_desc_qcom> CLMLRunner::MakeCLMLTensor(
+    std::vector<size_t> shape, std::string dtype, cl_ml_tensor_layout_qcom 
layout) {
+  cl_int result;
+  tensor_dims_t dims;
+  // Make sure the tensors with dimensions less than 4 are padded with 1.
+  shape.push_back(1);
+  shape.push_back(1);
+  shape.push_back(1);
+
+  dims.n = shape[0];
+  dims.c = shape[1];
+  dims.h = shape[2];
+  dims.w = shape[3];
+  cl_channel_type cl_dtype = MakeCLDataType(dtype);
+  auto tensor_dsc = std::make_shared<cl_ml_tensor_memory_desc_qcom>();
+  cl_ml_tensor_desc_qcom desc = {
+      cl_dtype, layout, dims.n, dims.c, dims.h, dims.w, 0, 
CL_TENSOR_DIMENSIONS_4D_QCOM, {0}};
+  result =
+      this->h_ClmlIntf->clCreateMLTensorQCOM(this->context, nullptr, &desc, 
&tensor_dsc->tensor);
+  CLML_SDK_TEST_AND_EXIT(tensor_dsc->tensor && result == CL_SUCCESS);
+  return tensor_dsc;
+}
+
+/*!
+ * \brief Convolution2D implementation.
+ * \param input_desc is input tensor descriptor.
+ * \param weight_desc is the kernel as tensor descriptor.
+ * \param bias_desc is bias as tensor descriptor.
+ * \param output_desc is the placeholder for convolution output.
+ * \param padding padding to be applied on input tensor.
+ * \param dilation is convolution dilation parameter.
+ * \param strides is convolution strides parameter.
+ * \param groups number of groups.
+ * \param mode is it normal convolution of depthwise convolution.
+ * \param activation activation to be applied on result.
+ * \param has_bias is bias tensor valid.
+ * \param has_activation is activation to be applied.
+ * \param dtype operator data type.
+ */
+void CLMLRunner::MakeConv2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                            std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
weight_desc,
+                            std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bias_desc,
+                            std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                            std::vector<cl_uint> padding, std::vector<cl_uint> 
dilation,
+                            std::vector<cl_uint> strides, int groups, 
cl_convolution_mode_qcom mode,
+                            cl_activation_function_qcom activation, bool 
has_bias, bool has_act,
+                            std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_int result;
+  if (CL_CONVOLUTION_MODE_CONVOLUTION_QCOM == mode) {
+    CLML_SDK_TEST_AND_EXIT(groups == 1);  // CLML convolution only supports 
group size of 1
+  } else {
+    groups = 1;  // Don't need to pass groups to depthwise
+  }
+  cl_ml_op_activation_desc_qcom act_desc = {activation, CL_PROPAGATE_NAN_QCOM, 
cl_arithmetic_mode};
+  cl_uint clml_padding_b[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {padding[0], 
padding[1]};
+  cl_uint clml_padding_a[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {padding[2], 
padding[3]};
+  cl_uint clml_strides[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {strides[0], 
strides[1]};
+  cl_uint clml_dilation[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {dilation[0], 
dilation[1]};
+
+  cl_ml_op_convolution_desc_qcom conv_desc{mode,
+                                           static_cast<cl_uint>(groups),
+                                           4,
+                                           {clml_padding_b[0], 
clml_padding_b[1]},
+                                           {clml_padding_a[0], 
clml_padding_a[1]},
+                                           {clml_strides[0], clml_strides[1]},
+                                           {clml_dilation[0], 
clml_dilation[1]},
+                                           0,
+                                           cl_arithmetic_mode};
+  cl_ml_op_qcom op = nullptr;
+  if (!has_act) {
+    result = h_ClmlIntf->clCreateMLOpConvolutionForwardQCOM(
+        this->context, 0, &conv_desc, input_desc->tensor, weight_desc->tensor, 
bias_desc->tensor,
+        output_desc->tensor, &op, tuning_cache);
+    CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  } else {
+    result = h_ClmlIntf->clCreateMLOpFusedConvolutionActivationForwardQCOM(
+        this->context, 0, &conv_desc, &act_desc, input_desc->tensor, 
weight_desc->tensor,
+        bias_desc->tensor, nullptr, output_desc->tensor, &op, tuning_cache);
+    CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  }
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Fused Convolution2D+BatchNorm implementation.
+ * \param input_desc is input tensor descriptor.
+ * \param weight_desc is the kernel as tensor descriptor.
+ * \param bias_desc is bias as tensor descriptor.
+ * \param output_desc is the placeholder for convolution output.
+ * \param bn_scale fused batchnorm scale tensor descriptor.
+ * \param bn_bias fused batchnorm scale tensor descriptor.
+ * \param bn_mean fused batchnorm mean tensor descriptor.
+ * \param bn_var fused batchnorm variance tensor descriptor.
+ * \param bn_attrs batchnorm other attributes.
+ * \param padding padding to be applied on input tensor.
+ * \param dilation is convolution dilation parameter.
+ * \param strides is convolution strides parameter.
+ * \param groups number of groups.
+ * \param mode is it normal convolution of depthwise convolution.
+ * \param activation activation to be applied on result.
+ * \param has_bias is bias tensor valid.
+ * \param has_activation is activation to be applied.
+ * \param dtype operator data type.
+ */
+void 
CLMLRunner::MakeConv2DWithBN(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> weight_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bias_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_scale,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_bias,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_mean,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_var,
+                                  std::vector<float> bn_attrs, 
std::vector<cl_uint> padding,
+                                  std::vector<cl_uint> dilation, 
std::vector<cl_uint> strides,
+                                  int groups, cl_convolution_mode_qcom mode,
+                                  cl_activation_function_qcom activation, bool 
has_bias,
+                                  bool has_act, std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_int result;
+  if (CL_CONVOLUTION_MODE_CONVOLUTION_QCOM == mode) {
+    CLML_SDK_TEST_AND_EXIT(groups == 1);  // CLML convolution only supports 
group size of 1
+  } else {
+    groups = 1;  // Don't need to pass groups to depthwise
+  }
+  cl_ml_op_activation_desc_qcom act_desc = {activation, CL_PROPAGATE_NAN_QCOM, 
cl_arithmetic_mode};
+  cl_uint clml_padding_b[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {padding[0], 
padding[1]};
+  cl_uint clml_padding_a[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {padding[2], 
padding[3]};
+  cl_uint clml_strides[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {strides[0], 
strides[1]};
+  cl_uint clml_dilation[CL_ML_TENSOR_MAX_SPATIAL_DIMS_QCOM] = {dilation[0], 
dilation[1]};
+
+  cl_ml_op_convolution_desc_qcom conv_desc{mode,
+                                           static_cast<cl_uint>(groups),
+                                           4,
+                                           {clml_padding_b[0], 
clml_padding_b[1]},
+                                           {clml_padding_a[0], 
clml_padding_a[1]},
+                                           {clml_strides[0], clml_strides[1]},
+                                           {clml_dilation[0], 
clml_dilation[1]},
+                                           0,
+                                           cl_arithmetic_mode};
+  cl_ml_op_qcom op = nullptr;
+  cl_ml_op_batchnorm_desc_qcom bn_desc = {CL_BATCHNORM_MODE_SPATIAL_QCOM, 
cl_arithmetic_mode};
+  if (!has_act) {
+    result = h_ClmlIntf->clCreateMLOpFusedConvolutionBatchNormForwardQCOM(
+        this->context, 0, &conv_desc, &bn_desc, input_desc->tensor, 
weight_desc->tensor,
+        bias_desc->tensor, output_desc->tensor, bn_mean->tensor, 
bn_var->tensor, bn_scale->tensor,
+        bn_bias->tensor, &op, tuning_cache);
+    CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  } else {
+    result = 
h_ClmlIntf->clCreateMLOpFusedConvolutionBatchNormActivationForwardQCOM(
+        this->context, 0, &conv_desc, &bn_desc, &act_desc, input_desc->tensor, 
weight_desc->tensor,
+        bias_desc->tensor, output_desc->tensor, nullptr, bn_mean->tensor, 
bn_var->tensor,
+        bn_scale->tensor, bn_bias->tensor, &op, tuning_cache);
+    CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  }
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief All types of ReLU(6) implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param relu_type the pf ReLU activation.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeRelu(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                          std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                          cl_activation_function_qcom relu_type, std::string 
dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+  cl_ml_op_activation_desc_qcom act_desc = {relu_type, CL_PROPAGATE_NAN_QCOM, 
cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpActivationForwardQCOM(
+      this->context, 0, &act_desc, input_desc->tensor, 
this->unusedTensor->tensor,
+      output_desc->tensor, &op, tuning_cache);
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Batch Normalization operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param bn_scale fused batchnorm scale tensor descriptor.
+ * \param bn_bias fused batchnorm scale tensor descriptor.
+ * \param bn_mean fused batchnorm mean tensor descriptor.
+ * \param bn_var fused batchnorm variance tensor descriptor.
+ * \param bn_attrs batchnorm other attributes.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeBatchNorm(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bn_scale,
+                               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bn_bias,
+                               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bn_mean,
+                               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bn_var,
+                               std::vector<float> bn_attrs, std::string dtype) 
{
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_op_batchnorm_desc_qcom bn_desc = {CL_BATCHNORM_MODE_SPATIAL_QCOM, 
cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpBatchNormForwardQCOM(
+      this->context, 0, &bn_desc, input_desc->tensor, bn_mean->tensor, 
bn_var->tensor,
+      bn_scale->tensor, bn_bias->tensor, output_desc->tensor, &op, 
tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief All types of Pool2D operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param pool_size pooling window size.
+ * \param strides stride for pooling.
+ * \param padding is the input padding.
+ * \param pool_type is type of poling (max, avg ...etc).
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakePool2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                            std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                            std::vector<cl_uint> pool_size, 
std::vector<cl_uint> strides,
+                            std::vector<cl_uint> padding, std::string 
pool_type,
+                            std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_op_pooling_desc_qcom pool_desc = {
+      pool_type == "nn.max_pool2d" ? CL_POOLING_MODE_MAX_QCOM
+                                   : 
CL_POOLING_MODE_AVERAGE_EXCLUDE_PADDING_QCOM,
+      4,  // reserved
+      {padding[0], padding[1]},
+      {padding[2], padding[3]},
+      {strides[0], strides[1]},
+      {pool_size[0], pool_size[1]},
+      CL_PROPAGATE_NAN_QCOM,
+      cl_arithmetic_mode,
+  };
+
+  result = h_ClmlIntf->clCreateMLOpPoolingForwardQCOM(
+      this->context, 0, &pool_desc, input_desc->tensor, 
this->unusedTensor->tensor,
+      output_desc->tensor, &op, tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief All types of Global Pooling 2D operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param in_shape is the input tensor shape.
+ * \param pool_type is the pool type (max or avg).
+ * \param dtype operator datatype.
+ */
+void 
CLMLRunner::MakeGlobalPool2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                                  std::vector<cl_uint> in_shape, std::string 
pool_type,
+                                  std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+  cl_ml_op_pooling_desc_qcom pool_desc = {
+      pool_type == "nn.global_max_pool2d" ? CL_POOLING_MODE_MAX_QCOM
+                                          : 
CL_POOLING_MODE_AVERAGE_EXCLUDE_PADDING_QCOM,
+      4,  // reserved
+      {0, 0},
+      {0, 0},
+      {1, 1},
+      {in_shape[2], in_shape[3]},
+      CL_PROPAGATE_NAN_QCOM,
+      cl_arithmetic_mode,
+  };
+
+  result = h_ClmlIntf->clCreateMLOpPoolingForwardQCOM(
+      this->context, 0, &pool_desc, input_desc->tensor, 
this->unusedTensor->tensor,
+      output_desc->tensor, &op, tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Reshape Operator.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeReshape(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                             std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                             std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  result = h_ClmlIntf->clCreateMLOpReshapeQCOM(this->context, 0, 
input_desc->tensor,
+                                               output_desc->tensor, &op, 
tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Concatenate operator implementation.
+ * \param in_list list of input tensor descriptors to concatenate.
+ * \param output_desc output tensor descriptor.
+ * \param axis is the dimention on which we join the tensors.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeConcatenate(
+    std::vector<std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> in_list,
+    std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc, int axis, 
std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_tensor_qcom* concatInputs = new cl_ml_tensor_qcom[in_list.size()];
+  for (int i = 0; i < in_list.size(); i++) {
+    concatInputs[i] = in_list[i]->tensor;
+  }
+  cl_ml_op_concat_desc_qcom concatDesc = {1, (cl_uint)in_list.size(), 
cl_arithmetic_mode};
+  result = h_ClmlIntf->clCreateMLOpConcatQCOM(this->context, 0, &concatDesc, 
concatInputs,
+                                              output_desc->tensor, &op, 
tuning_cache);
+  delete[] concatInputs;
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Dense operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param weight_desc weight tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param bias_desc bias tensor descriptor.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeDense(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                           std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
weight_desc,
+                           std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                           std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bias_desc,
+                           std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_op_convolution_desc_qcom conv_desc = 
{CL_CONVOLUTION_MODE_CONVOLUTION_QCOM,
+                                              1,
+                                              4,
+                                              {0, 0},
+                                              {0, 0},
+                                              {1, 1},
+                                              {1, 1},
+                                              0,
+                                              cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpConvolutionForwardQCOM(
+      this->context, 0, &conv_desc, input_desc->tensor, weight_desc->tensor, 
bias_desc->tensor,
+      output_desc->tensor, &op, tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief SoftMax operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeSoftMax(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                             std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                             std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_op_softmax_desc_qcom softmax_desc = 
{CL_SOFTMAX_ALGORITHM_ACCURATE_QCOM,
+                                             CL_SOFTMAX_MODE_INSTANCE_QCOM, 
cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpSoftmaxQCOM(this->context, 0, 
&softmax_desc, input_desc->tensor,
+                                               output_desc->tensor, &op, 
tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief .
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param pad_mode type of padding to be applied (constant, edge, reflect 
...etc).
+ * \param padding amount of padding.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakePad(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                         std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                         std::string pad_mode, std::vector<cl_uint> padding, 
std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_pad_mode_qcom clml_pad_mode = CL_PAD_MODE_CONSTANT_QCOM;
+  if (pad_mode == "constant")
+    clml_pad_mode = CL_PAD_MODE_CONSTANT_QCOM;
+  else if (pad_mode == "edge")
+    clml_pad_mode = CL_PAD_MODE_SYMMETRIC_QCOM;
+  else if (pad_mode == "reflect")
+    clml_pad_mode = CL_PAD_MODE_REFLECT_QCOM;
+  else
+    LOG(FATAL) << "Padding mode not supported by CLML:" << pad_mode;
+
+  cl_ml_op_pad_desc_qcom pad_desc{clml_pad_mode,
+                                  {0, 0},
+                                  {padding[0], padding[1], padding[2], 
padding[3], 0, 0, 0, 0},
+                                  cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpPadQCOM(this->context, 0, &pad_desc, 
input_desc->tensor,
+                                           output_desc->tensor, &op, 
tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Batch Flatten operator implementation.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param dtype operator datatype.
+ */
+void 
CLMLRunner::MakeBatchFlatten(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                                  
std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                                  std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  result = h_ClmlIntf->clCreateMLOpReshapeQCOM(this->context, 0, 
input_desc->tensor,
+                                               output_desc->tensor, &op, 
tuning_cache);
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief Clip operator.
+ * \param input_desc input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param a_max is the upper bound to clip.
+ * \param a_min is the lower bound to clip.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeClip(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                          std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc, float a_max,
+                          float a_min, std::string dtype) {
+  LOG(INFO) << "MakeClip called";
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_ml_op_clip_desc_qcom clip_desc = {
+      CL_CLIP_BY_VALUE_QCOM, {{a_max}, CL_FLOAT}, {{a_min}, CL_FLOAT}, 
cl_arithmetic_mode};
+
+  result = h_ClmlIntf->clCreateMLOpClipQCOM(this->context, 0, &clip_desc, 
input_desc->tensor,
+                                            output_desc->tensor, &op, 
tuning_cache);
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+/*!
+ * \brief All types of Binary operators.
+ * \param input_a first input tensor descriptor.
+ * \param input_b second input tensor descriptor.
+ * \param output_desc output tensor descriptor.
+ * \param op_name is the binary operator.
+ * \param dtype operator datatype.
+ */
+void CLMLRunner::MakeBinaryOp(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_a,
+                              std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_b,
+                              std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                              std::string op_name, std::string dtype) {
+  cl_arithmetic_mode_qcom cl_arithmetic_mode = 
MakeCLArithMode(MakeCLDataType(dtype));
+  cl_ml_op_qcom op = nullptr;
+  cl_int result;
+
+  cl_binary_op_qcom binary_op = CL_TENSOR_OP_ADD_QCOM;
+  if (op_name == "subtract")
+    binary_op = CL_TENSOR_OP_SUB_QCOM;
+  else if (op_name == "multiply")
+    binary_op = CL_TENSOR_OP_MUL_QCOM;
+  else if (op_name == "divide")
+    binary_op = CL_TENSOR_OP_DIV_QCOM;
+  else if (op_name == "minimum")
+    binary_op = CL_TENSOR_OP_MIN_QCOM;
+  else if (op_name == "maximum")
+    binary_op = CL_TENSOR_OP_MAX_QCOM;
+  cl_ml_op_binary_desc_qcom add_desc = {
+      binary_op, {{1.0}, CL_FLOAT}, {{1.0}, CL_FLOAT}, {{0.0}, CL_FLOAT}, 
cl_arithmetic_mode};
+
+  result =
+      h_ClmlIntf->clCreateMLOpBinaryQCOM(this->context, 0, &add_desc, 
input_a->tensor,
+                                         input_b->tensor, output_desc->tensor, 
&op, tuning_cache);
+
+  CLML_SDK_TEST_AND_EXIT(op && result == CL_SUCCESS);
+  this->function.push_back(op);
+}
+
+}  // namespace runtime
+}  // namespace tvm
diff --git a/apps/cpp_clml/clml_runner.h b/apps/cpp_clml/clml_runner.h
new file mode 100644
index 0000000000..4e73674d72
--- /dev/null
+++ b/apps/cpp_clml/clml_runner.h
@@ -0,0 +1,262 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file clml_runner.h
+ * \brief CLML model runner.
+ */
+#ifndef CLML_APPS_CPP_RCLML_RUNNER_H_
+#define CLML_APPS_CPP_RCLML_RUNNER_H_
+
+#include <csignal>
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+#include <string>
+#if defined(__linux__) || defined(__ANDROID__)
+#include <unistd.h>
+#endif
+
+#include <CL/cl_qcom_ml_ops.h>
+#include <cnpy.h>
+#include <dmlc/io.h>
+
+#include "CL/cl.h"
+
+#define CLML_SDK_TEST_AND_EXIT(expression)                                     
                 \
+  {                                                                            
                 \
+    {                                                                          
                 \
+      int _n_ = !(expression);                                                 
                 \
+      if (_n_) {                                                               
                 \
+        fprintf(stderr, "Error on line %d of %s\nFailing expression: %s\n", 
__LINE__, __FILE__, \
+                #expression);                                                  
                 \
+        exit(1);                                                               
                 \
+      }                                                                        
                 \
+    }                                                                          
                 \
+  }
+
+#define CAT_I(a, b) a##b
+#define CAT(a, b) CAT_I(a, b)
+#define GET_ML_INTERFACE CAT(CAT(clGetMLInterfaceV, 
CL_QCOM_ML_OPS_H_MAJOR_VERSION), QCOM)
+#define GET_ML_API_INTERFACE CAT(CAT(CLMLInterfaceV, 
CL_QCOM_ML_OPS_H_MAJOR_VERSION), QCOM)
+
+namespace tvm {
+namespace runtime {
+
+/**
+ * \brief Tensor dimensions, batch, channel, height, width
+ *
+ */
+struct tensor_dims_t {
+  uint32_t n, c, h, w;
+};
+
+/*!
+ * \brief Tool Arguments.
+ * \arg input Numpy file for the model input
+ * \arg output Numpy file name to dump the model output as numpy
+ * \arg parsms Numpy file holding the params for models
+ */
+struct ToolArgs {
+  std::string input;
+  std::string output;
+  std::string params;
+  bool dump_meta = false;
+};
+
+/*!
+ * \brief encapsulates CLML Runner functionality for the sub graph
+ */
+class CLMLRunner {
+ public:
+  /*! \brief Constructor */
+  CLMLRunner(std::string name, ToolArgs& args, cl_platform_id arg_platform_id,
+             cl_context arg_context, cl_device_id arg_device_id, 
cl_command_queue arg_queue);
+
+  /*! \brief Returns the name for this sub graph */
+  std::string GetModName(void) { return r_name; }
+  /*! \brief Executes one cycle all CLML ops */
+  int Run(void);
+  /*! \brief set meta information */
+  void SetMetaInfo(std::string minfo);
+  /*! \brief Print function to show all meta information */
+  void PrintMetaInfo(void);
+  /*! \brief initializes the unusedTensor */
+  void MakeUnusedTensor(void);
+  /*! \brief Copy given bytestream of data to the tensor */
+  void CopyDataToCLMLTensor(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
tensor, void* data,
+                            cl_ml_tensor_layout_qcom layout = 
CL_TENSOR_LAYOUT_NCHW_QCOM);
+  /*! \brief Copy tensor data to data in expected layout format */
+  void CopyDataFromCLMLTensor(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
tensor, void* data,
+                              cl_ml_tensor_layout_qcom layout = 
CL_TENSOR_LAYOUT_NCHW_QCOM);
+  /*! \brief Allocates memory for the tensor descriptor */
+  cl_int AllocateTensorMemory(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
pTensorMemDesc);
+  /*!
+   * \brief Allocates memory for all tensor descriptor in storage map.
+   * Also initializes the parameter nodes, inputs from given numpy dumps if 
provided.
+   */
+  void AllocateMemAndPopulateParams(void);
+  /*! \brief Create a tensor descriptor given it's shape, dtype and layout */
+  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> MakeCLMLTensor(
+      std::vector<size_t> shape, std::string dtype = "float32",
+      cl_ml_tensor_layout_qcom layout = CL_TENSOR_LAYOUT_OPTIMAL_QCOM);
+  /*! \brief Conv2D layer implementattion */
+  void MakeConv2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> weight_desc,
+                  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bias_desc,
+                  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                  std::vector<cl_uint> padding, std::vector<cl_uint> dilation,
+                  std::vector<cl_uint> strides, int groups, 
cl_convolution_mode_qcom mode,
+                  cl_activation_function_qcom activation, bool has_bias, bool 
has_act,
+                  std::string dtype);
+
+  /*! \brief Conv2D with Fused BatchNorm layer implementattion */
+  void MakeConv2DWithBN(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
weight_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bias_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
bn_scale,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_bias,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_mean,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_var,
+                        std::vector<float> bn_attrs, std::vector<cl_uint> 
padding,
+                        std::vector<cl_uint> dilation, std::vector<cl_uint> 
strides, int groups,
+                        cl_convolution_mode_qcom mode, 
cl_activation_function_qcom activation,
+                        bool has_bias, bool has_act, std::string dtype);
+
+  /*! \brief ReLU layer implementattion */
+  void MakeRelu(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                cl_activation_function_qcom relu_type, std::string dtype);
+
+  /*! \brief Batch Normalization layer implementattion */
+  void MakeBatchNorm(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                     std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                     std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_scale,
+                     std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_bias,
+                     std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_mean,
+                     std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bn_var,
+                     std::vector<float> bn_attrs, std::string dtype);
+
+  /*! \brief Pool2D (with all variants) layer implementattion */
+  void MakePool2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                  std::vector<cl_uint> pool_size, std::vector<cl_uint> strides,
+                  std::vector<cl_uint> padding, std::string pool_type, 
std::string dtype);
+
+  /*! \brief GlobalPool2D (with all variants) layer implementattion */
+  void MakeGlobalPool2D(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                        std::vector<cl_uint> in_shape, std::string pool_type, 
std::string dtype);
+
+  /*! \brief Reshape layer implementattion */
+  void MakeReshape(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                   std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc, 
std::string dtype);
+
+  /*! \brief Concatenate layer implementattion */
+  void 
MakeConcatenate(std::vector<std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> 
in_list,
+                       std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc, int axis,
+                       std::string dtype);
+
+  /*! \brief Dense layer implementattion */
+  void MakeDense(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                 std::shared_ptr<cl_ml_tensor_memory_desc_qcom> weight_desc,
+                 std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc,
+                 std::shared_ptr<cl_ml_tensor_memory_desc_qcom> bias_desc, 
std::string dtype);
+
+  /*! \brief SoftMax layer implementattion */
+  void MakeSoftMax(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                   std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc, 
std::string dtype);
+
+  /*! \brief Pad layer implementattion */
+  void MakePad(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+               std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc, 
std::string pad_mode,
+               std::vector<cl_uint> padding, std::string dtype);
+
+  /*! \brief Batch Flatten layer implementattion */
+  void MakeBatchFlatten(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
input_desc,
+                        std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc,
+                        std::string dtype);
+
+  /*! \brief Clip layer implementattion */
+  void MakeClip(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_desc,
+                std::shared_ptr<cl_ml_tensor_memory_desc_qcom> output_desc, 
float a_max,
+                float a_min, std::string dtype);
+
+  /*! \brief Binary Operator (with all types) layer implementattion */
+  void MakeBinaryOp(std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_a,
+                    std::shared_ptr<cl_ml_tensor_memory_desc_qcom> input_b,
+                    std::shared_ptr<cl_ml_tensor_memory_desc_qcom> 
output_desc, std::string op_name,
+                    std::string dtype);
+
+  /*! \brief Vector of created operators */
+  std::vector<cl_ml_op_qcom> function;
+  /*! \brief Vector of graph's input tensor descriptors */
+  std::vector<std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> inputs;
+  /*! \brief Map of graph's output tensor descriptors with names */
+  std::map<std::string, std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> 
outputs;
+  /*! \brief Map of graph's output tensor names and dtypes */
+  std::map<std::string, std::string> outputs_dtypes;
+  /*! \brief Map of graph's output tensor names and shapes */
+  std::map<std::string, std::vector<size_t>> outputs_shapes;
+  /*! \brief Overall storage map for all tensor descriptors involved */
+  std::map<std::string, std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> 
storage_map;
+  /*! \brief List of const tensor of the graph */
+  std::vector<std::string> consts;
+  /*! \brief List of all memory descriptor in graph */
+  std::vector<cl_ml_tensor_memory_desc_qcom> tensorMemDescs;
+  /*! \brief Tensor memory descriptor set */
+  cl_ml_tensor_mem_desc_set_qcom descriptorSet;
+  /*! \brief Unused tensor used across various ops */
+  std::shared_ptr<cl_ml_tensor_memory_desc_qcom> unusedTensor;
+
+  /*! \brief  ML API interface */
+  GET_ML_API_INTERFACE* h_ClmlIntf = nullptr;
+  /*! \brief  Tuning cache object */
+  cl_ml_tuningcache_qcom tuning_cache = nullptr;
+  /*! \brief  Flag to inticate a tuning run */
+  bool is_tuning_run;
+  /*! \brief  The tuning file for loading or storing cache */
+  char* tuning_file;
+
+  /*! \brief  OpenCL platform */
+  cl_platform_id platform{nullptr};
+  /*! \brief  OpenCL context */
+  cl_context context{nullptr};
+  /*! \brief  OpenCL device */
+  cl_device_id device_id{nullptr};
+  /*! \brief  OpenCL Queue */
+  cl_command_queue queue{nullptr};
+  /*! \brief  Numpy object for params */
+  cnpy::npz_t npz_params;
+  /*! \brief  Numpy object for inputs */
+  cnpy::npz_t npz_input;
+
+ private:
+  /*! \brief unique name for the runner */
+  std::string r_name;
+  /*! \brief arguments */
+  ToolArgs r_args;
+  /*! \brief Holds meta information from clml codegen */
+  std::string meta_info;
+};
+
+}  // namespace runtime
+}  // namespace tvm
+#endif  // CLML_APPS_CPP_RCLML_RUNNER_H_
diff --git a/apps/cpp_clml/main.cc b/apps/cpp_clml/main.cc
new file mode 100644
index 0000000000..b918618a17
--- /dev/null
+++ b/apps/cpp_clml/main.cc
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file main.cc
+ * \brief CLML Model execution application.
+ */
+
+#include "clml_runner.h"
+
+using namespace tvm::runtime;
+
+/*!
+ * \brief Auto generated model file (clml_models.cc) entry function definition.
+ * \param args The tool arguments to forward
+ * \param arg_platform OpenCL platform
+ * \param arg_context OpenCL context
+ * \param arg_device_id OpenCL device id
+ * \param queue OpenCL queue
+ * \return List of CLMLRunner objects corresponding to all sub graphs of a TVM 
module.
+ */
+std::vector<CLMLRunner> BuildModules(ToolArgs& args, cl_platform_id 
arg_platform,
+                                     cl_context arg_context, cl_device_id 
arg_device_id,
+                                     cl_command_queue queue);
+
+static const std::string kUsage =
+    "Command line usage\n"
+    "--input        - Numpy file for the model input (optional and we use 
random of not given)\n"
+    "--output       - Numpy file name to dump the model output as numpy\n"
+    "--params       - Numpy file with params\n"
+    "--dump-meta    - Dump model meta information\n"
+    "\n"
+    "  Example\n"
+    "  ./clml_run --dump-meta\n"
+    "  ./clml_run --params=clmlparams.npz\n"
+    "  ./clml_run --input=input.npz --output=output.npz 
--params=clml_params.npz\n"
+    "\n";
+
+/*!
+ * \brief PrintArgs print the contents of ToolArgs
+ * \param args ToolArgs structure
+ */
+void PrintArgs(const ToolArgs& args) {
+  LOG(INFO) << "Input         = " << args.input;
+  LOG(INFO) << "Output        = " << args.output;
+  LOG(INFO) << "Params        = " << args.params;
+  LOG(INFO) << "DumpMeta      = " << args.dump_meta;
+}
+
+#if defined(__linux__) || defined(__ANDROID__)
+/*!
+ * \brief CtrlCHandler, exits if Ctrl+C is pressed
+ * \param s signal
+ */
+void CtrlCHandler(int s) {
+  LOG(INFO) << "User pressed Ctrl+C, Exiting";
+  exit(1);
+}
+
+/*!
+ * \brief HandleCtrlC Register for handling Ctrl+C event.
+ */
+void HandleCtrlC() {
+  // Ctrl+C handler
+  struct sigaction sigIntHandler;
+  sigIntHandler.sa_handler = CtrlCHandler;
+  sigemptyset(&sigIntHandler.sa_mask);
+  sigIntHandler.sa_flags = 0;
+  sigaction(SIGINT, &sigIntHandler, nullptr);
+}
+#endif
+/*!
+ * \brief GetCmdOption Parse and find the command option.
+ * \param argc arg counter
+ * \param argv arg values
+ * \param option command line option to search for.
+ * \param key whether the option itself is key
+ * \return value corresponding to option.
+ */
+std::string GetCmdOption(int argc, char* argv[], std::string option, bool key 
= false) {
+  std::string cmd;
+  for (int i = 1; i < argc; ++i) {
+    std::string arg = argv[i];
+    if (arg.find(option) == 0) {
+      if (key) {
+        cmd = argv[i];
+        return cmd;
+      }
+      // We assume "=" is the end of option.
+      // ICHECK_EQ(*option.rbegin(), '=');
+      cmd = arg.substr(arg.find('=') + 1);
+      return cmd;
+    }
+  }
+  return cmd;
+}
+
+/*!
+ * \brief ParseCmdArgs parses the command line arguments.
+ * \param argc arg counter
+ * \param argv arg values
+ * \param args the output structure which holds the parsed values
+ */
+void ParseCmdArgs(int argc, char* argv[], struct ToolArgs& args) {
+  const std::string input = GetCmdOption(argc, argv, "--input=");
+  if (!input.empty()) {
+    args.input = input;
+  }
+
+  const std::string output = GetCmdOption(argc, argv, "--output=");
+  if (!output.empty()) {
+    args.output = output;
+  }
+
+  const std::string params = GetCmdOption(argc, argv, "--params=");
+  if (!params.empty()) {
+    args.params = params;
+  }
+
+  const std::string pmeta = GetCmdOption(argc, argv, "--dump-meta", true);
+  if (!pmeta.empty()) {
+    args.dump_meta = true;
+  }
+}
+
+/*!
+ * \brief Check CLML extension availability in the CL device.
+ * \param platform_id OpenCL platform
+ * \param device_id OpenCL device id
+ * \return true if extension present else false.
+ */
+bool ExtensionStringPresent(cl_platform_id platform_id, cl_device_id 
device_id) {
+  cl_int result = 0;
+  size_t reqd_size = 0;
+  result = clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, nullptr, 
&reqd_size);
+  CLML_SDK_TEST_AND_EXIT(reqd_size > 0u && result == CL_SUCCESS);
+
+  std::vector<char> buf(reqd_size);
+  result = clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, reqd_size, 
buf.data(), nullptr);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  std::string extensions(buf.data());
+  LOG(WARNING) << "OpenCL Extensions:" << extensions;
+  return (extensions.find("cl_qcom_ml_ops") != std::string::npos);
+}
+
+/*!
+ * \brief Loads and Executes the model on given Target.
+ * \param args tool arguments
+ * \return result of operation.
+ */
+int ExecuteModel(ToolArgs& args) {
+#if defined(__linux__) || defined(__ANDROID__)
+  // Ctrl+C handler
+  HandleCtrlC();
+#endif
+
+  // Init OpenCL Environment
+  cl_int result;
+  cl_event readEvent = nullptr;
+  cl_platform_id platform = nullptr;
+  cl_context context = nullptr;
+  cl_device_id device_id = nullptr;
+  cl_command_queue queue = nullptr;
+
+  // Initialize Context and Command Queue
+  result = clGetPlatformIDs(1, &platform, nullptr);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  uint32_t num_devices = 0;
+  result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, nullptr, 
&num_devices);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS && num_devices == 1);
+
+  result = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device_id, 
nullptr);
+  CLML_SDK_TEST_AND_EXIT(device_id && result == CL_SUCCESS);
+
+  CLML_SDK_TEST_AND_EXIT(ExtensionStringPresent(platform, device_id) == true);
+
+  context = clCreateContext(0, 1, &device_id, nullptr, nullptr, &result);
+  CLML_SDK_TEST_AND_EXIT(result == CL_SUCCESS);
+
+  cl_command_queue_properties queue_props = 0;
+
+  queue = clCreateCommandQueue(context, device_id, queue_props, &result);
+  CLML_SDK_TEST_AND_EXIT(queue && result == CL_SUCCESS);
+
+  // Populate the runner with model
+  LOG(INFO) << "Call Build Modules\n";
+  auto runners = BuildModules(args, platform, context, device_id, queue);
+
+  LOG(INFO) << "Loop Through the Modules";
+  for (auto runner : runners) {
+    if (args.dump_meta) {
+      // Print Meta Information
+      runner.PrintMetaInfo();
+    }
+
+    // Run the model
+    runner.Run();
+  }
+
+  return 0;
+}
+
+/*!
+ * \brief main The main function.
+ * \param argc arg counter
+ * \param argv arg values
+ * \return result of operation.
+ */
+int main(int argc, char* argv[]) {
+  if (argc <= 1) {
+    LOG(INFO) << kUsage;
+    return 0;
+  }
+
+  ToolArgs args;
+  ParseCmdArgs(argc, argv, args);
+  PrintArgs(args);
+
+  if (ExecuteModel(args)) {
+    PrintArgs(args);
+    LOG(INFO) << kUsage;
+    return -1;
+  }
+  return 0;
+}
diff --git a/apps/cpp_clml/scripts/clml_codegen.py 
b/apps/cpp_clml/scripts/clml_codegen.py
new file mode 100644
index 0000000000..32e5782db3
--- /dev/null
+++ b/apps/cpp_clml/scripts/clml_codegen.py
@@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import sys
+import numpy as np
+
+import tvm
+from tvm import relay
+from tvm.driver import tvmc
+from tvm.relay.op.contrib import clml
+from tvm.contrib import utils
+from string import Template
+
+
+def main():
+    print("CLML Codegen")
+    if len(sys.argv) != 2:
+        print("Usage: python clml_codegen.py <model_path>")
+        return
+
+    tvmc_model = tvmc.load(sys.argv[1])
+    mod = tvmc_model.mod
+    params = tvmc_model.params
+    with tvm.transform.PassContext(opt_level=3):
+        mod = tvmc.transform.convert_graph_layout(mod, "NCHW")
+    with tvm.transform.PassContext(opt_level=3, 
disabled_pass=["AlterOpLayout"]):
+        clml_mod = clml.partition_for_clml(mod, params)
+        libm = relay.build(
+            clml_mod,
+            target="opencl -device=adreno",
+            target_host="llvm -mtriple=aarch64-linux-gnu",
+            params=params,
+        )
+
+        # Extract CLML related params
+        (clml_params_save, gen_src) = clml.CLMLGenSrc(libm).get_artifacts()
+        np.savez("clml_params.npz", **clml_params_save)
+
+        f_src = open("../clml_models.cc", "w")
+        f_src.write("\n".join(gen_src))
+        f_src.close()
+        os.popen("clang-format-10 -i ../clml_models.cc")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/cmake/modules/contrib/CLML.cmake b/cmake/modules/contrib/CLML.cmake
index 811b8f8d58..c388e85b14 100644
--- a/cmake/modules/contrib/CLML.cmake
+++ b/cmake/modules/contrib/CLML.cmake
@@ -54,7 +54,7 @@ if(USE_CLML_GRAPH_EXECUTOR)
     include_directories(${CLML_INCLUDE_DIRS})
     find_library(EXTERN_CLML_COMPUTE_LIB
           NAMES OpenCL libOpenCL
-          HINTS "${CLML_PATH}" "${CLML_PATH}/lib64"
+          HINTS "${CLML_PATH}" "${CLML_PATH}/lib64" "${CLML_PATH}/lib"
           )
     list(APPEND TVM_RUNTIME_LINKER_LIBS ${EXTERN_CLML_COMPUTE_LIB})
     list(APPEND RUNTIME_SRCS ${CLML_CONTRIB_SRC})
diff --git a/docker/Dockerfile.ci_adreno b/docker/Dockerfile.ci_adreno
index 2f609a69c4..8f4ede3a5e 100644
--- a/docker/Dockerfile.ci_adreno
+++ b/docker/Dockerfile.ci_adreno
@@ -27,3 +27,6 @@ ENV ANDROID_HOME=/opt/android-sdk-linux
 ENV ANDROID_NDK_HOME=/opt/android-sdk-linux/ndk/21.3.6528147
 ENV ANDROID_NDK_MAJOR=21
 ENV PATH /opt/android-sdk-linux/platform-tools:$PATH
+
+# Clang tool for CLML source codegen
+RUN apt-get update && apt-install-and-clear -y clang-format-10
diff --git a/python/tvm/relay/op/contrib/clml.py 
b/python/tvm/relay/op/contrib/clml.py
index ec8cbb6320..1b504ac033 100644
--- a/python/tvm/relay/op/contrib/clml.py
+++ b/python/tvm/relay/op/contrib/clml.py
@@ -16,6 +16,8 @@
 # under the License.
 # pylint: disable=invalid-name, unused-argument
 """CLML Library supported operators."""
+import json
+from string import Template
 import tvm
 
 from tvm import relay
@@ -308,6 +310,10 @@ def clml_pattern_table():
         call = extract
         if len(call.attrs["pad_width"]) != 4:
             return False
+        # CLML can't process Tensor padding with out knowing layout.
+        # Pad layers before any convolution are not guarenteed to be NCHW.
+        if isinstance(call.args[0], tvm.relay.expr.Var):
+            return False
         return True
 
     def check_softmax_op(extract):
@@ -401,3 +407,769 @@ class OpAttrContext(object):
         self.op.reset_attr(self.attr_key)
         if self.older_attr:
             self.op.set_attr(self.attr_key, self.older_attr)
+
+
+class CLMLGetSubModuleSrc:
+    """Generates CLML API one CLML sub module out ot global TVM module"""
+
+    def __init__(self, cmod):
+        """Initialize
+        Parameters
+        ----------
+        cmod : Module
+            The CLML sub module from TVM module
+        """
+        self.cmod = cmod
+        self.codegen = None
+        self.nodes = None
+        self.node_map = {}
+        self.input_meta = []
+        self.output_meta = []
+        self.clml_code = []
+        self.sub_module_name = None
+
+        self.MakeCLMLTensor = Template(
+            """auto $name = runner.MakeCLMLTensor
+        (std::vector<size_t>({$shape}), "$dtype", $layout);"""
+        )
+        self.MapInsert = Template("""runner.storage_map.insert({"$nid", 
$tensor_desc});""")
+        self.MakeConv2D = Template(
+            """
+        // Convolution / Depthwise Convolution
+        runner.MakeConv2D($input_tensor,
+           $weight_tensor,
+           $bias_tensor,
+           $output_tensor,
+           std::vector<cl_uint>({$padding}),
+           std::vector<cl_uint>({$dilation}),
+           std::vector<cl_uint>({$strides}),
+           $groups,
+           $mode,
+           $activation,
+           $has_bias,
+           $has_act,
+           "$dtype");"""
+        )
+        self.MakeConv2DWithBN = Template(
+            """
+        // Batchnorm
+        runner.MakeConv2DWithBN($input_tensor,
+                 $weight_tensor,
+                 $bias_tensor,
+                 $output_tensor,
+                 $bn_scale_tensor,
+                 $bn_bias_tensor,
+                 $bn_mean_tensor,
+                 $bn_var_tensor,
+                 std::vector<float>  ({$bn_attrs}),
+                 std::vector<cl_uint> ({$padding}),
+                 std::vector<cl_uint> ({$dilation}),
+                 std::vector<cl_uint> ({$strides}),
+                 $groups,
+                 $mode,
+                 $activation,
+                 $has_bias,
+                 $has_act,
+                 "$dtype");"""
+        )
+        self.MakeRelu = Template(
+            """
+        // Relu / Relu6
+        runner.MakeRelu($input_tensor, $output_tensor, $relu_type, "$dtype");
+        """
+        )
+        self.MakeBN = Template(
+            """
+        // Batchnorm
+        runner.MakeBatchNorm($input_tensor,
+              $output_tensor,
+              $bn_scale_tensor,
+              $bn_bias_tensor,
+              $bn_mean_tensor,
+              $bn_var_tensor,
+              std::vector<float> ({$bn_attrs}), "$dtype");"""
+        )
+        self.MakePool2D = Template(
+            """
+        // Pool2D
+        runner.MakePool2D($input_tensor,
+           $output_tensor,
+           std::vector<cl_uint> ({$pool_size}),
+           std::vector<cl_uint> ({$strides}),
+           std::vector<cl_uint> ({$padding}),
+           "$pool_type", "$dtype");"""
+        )
+        self.MakeGlobalPool2D = Template(
+            """
+        // GlobalPool2D
+        runner.MakeGlobalPool2D($input_tensor,
+                 $output_tensor,
+                 std::vector<cl_uint> ({$in_shape}),
+                 "$pool_type", "$dtype");"""
+        )
+        self.MakeReshape = Template(
+            """
+        // Reshape
+        runner.MakeReshape($input_tensor,
+            $output_tensor, "$dtype");"""
+        )
+        self.MakeConcatenate = Template(
+            """
+        // Concatinate
+        runner.MakeConcatenate(
+                std::vector<std::shared_ptr<cl_ml_tensor_memory_desc_qcom>> 
({$in_list}),
+                $output_tensor,
+                $axis, "$dtype");"""
+        )
+        self.MakeDense = Template(
+            """
+        // Dense
+        runner.MakeDense($input_tensor,
+          $weight_tensor,
+          $output_tensor,
+          $bias_tensor, "$dtype");"""
+        )
+        self.MakeSoftMax = Template(
+            """
+        // Softmax
+        runner.MakeSoftMax($input_tensor,
+            $output_tensor, "$dtype");"""
+        )
+        self.MakePad = Template(
+            """
+        // Pad
+        runner.MakePad($input_tensor,
+        $output_tensor,
+        "$pad_mode",
+        std::vector<cl_uint> ({$padding}), "$dtype");"""
+        )
+        self.MakeBatchFlatten = Template(
+            """
+        // BatchFlatten
+        runner.MakeBatchFlatten($input_tensor,
+                 $output_tensor, "$dtype");"""
+        )
+        self.MakeClip = Template(
+            """
+        // Clip
+        runner.MakeClip($input_tensor,
+         $output_tensor,
+         $a_max,
+         $a_min,
+         "$dtype");"""
+        )
+        self.MakeBinaryOp = Template(
+            """
+        // BinaryOp
+        runner.MakeBinaryOp($input_a,
+             $input_b,
+             $output_tensor,
+             "$op", "$dtype");"""
+        )
+
+        self.MakeHeader = Template(
+            """
+        CLMLRunner $module(std::string name,
+                   ToolArgs& args,
+                   cl_platform_id arg_platform_id,
+                   cl_context arg_context,
+                   cl_device_id arg_device_id,
+                   cl_command_queue arg_queue) {
+        CLMLRunner runner = CLMLRunner(name,
+                                 args,
+                                 arg_platform_id,
+                                 arg_context,
+                                 arg_device_id,
+                                 arg_queue);
+        runner.MakeUnusedTensor();
+        """
+        )
+
+        self.MakeFooter = Template(
+            """
+            return runner;
+        }
+        """
+        )
+
+        self.MakeMetaInfo = Template(
+            "runner.SetMetaInfo("
+            '"Subgraph Name: $name\\n    Input Count  : $input_count\\n'
+            "    Output Count : $output_count\\n"
+            '    Input MetaInfo\\n$input_meta\\n    Output 
MetaInfo\\n$output_meta");'
+        )
+
+        self.MakeInputMetaInfo = Template(
+            "        Input: $in_name\\n            Dtype : $dtype\\n           
 Shape : [$shape]"
+        )
+
+        self.MakeOutputMetaInfo = Template(
+            "        Output: $out_name\\n            Dtype : $dtype\\n         
   Shape : [$shape]"
+        )
+
+    def get_src(self):
+        """Returns pair of sub module name and the generated source"""
+
+        self.codegen = json.loads(self.cmod.get_source("json"))
+        self.sub_module_name = self.codegen["symbol"]
+        self.nodes = self.codegen["nodes"]
+        
self.clml_code.append(self.MakeHeader.substitute(module=self.sub_module_name))
+
+        def get_tensor_from_map(
+            node_seq, shape=None, layout="CL_TENSOR_LAYOUT_OPTIMAL_QCOM", 
dtype="float32"
+        ):
+            if node_seq in self.node_map:
+                return self.node_map[node_seq]
+            else:
+                node = self.nodes[node_seq]
+                dtype = str(node["attrs"]["dtype"][0][0])
+                if shape is None:
+                    shape = str(tuple(node["attrs"]["shape"][0][0]))[1:-1]
+
+                self.clml_code.append(
+                    self.MakeCLMLTensor.substitute(
+                        name=node["name"], shape=shape, dtype=dtype, 
layout=layout
+                    )
+                )
+                self.clml_code.append(
+                    self.MapInsert.substitute(nid=node["name"], 
tensor_desc=node["name"])
+                )
+                if self.nodes[node_seq]["op"] == "const":
+                    self.clml_code.append(
+                        
Template('runner.consts.push_back("$nid");').substitute(nid=node["name"])
+                    )
+                self.node_map[node_seq] = node["name"]
+                return node["name"]
+
+        def make_output_tensor(
+            node, node_seq, shape=None, 
layout="CL_TENSOR_LAYOUT_OPTIMAL_QCOM", dtype="float32"
+        ):
+            if dtype is None:
+                dtype = str(node["attrs"]["dtype"][0][0])
+            if shape is None:
+                shape = str(tuple(node["attrs"]["shape"][0][0]))[1:-1]
+            node_out_name = self.sub_module_name + "_" + "layer_out_" + 
str(node_seq)
+            self.clml_code.append(
+                self.MakeCLMLTensor.substitute(
+                    name=node_out_name,
+                    shape=shape,
+                    dtype=dtype,
+                    layout="CL_TENSOR_LAYOUT_OPTIMAL_QCOM",
+                )
+            )
+            return node_out_name
+
+        for node_seq, node in enumerate(self.nodes):
+            if node["op"] == "input":
+                self.clml_code.append("// Input Node")
+                dtype = str(node["attrs"]["dtype"][0][0])
+                shape = str(tuple(node["attrs"]["shape"][0][0]))[1:-1]
+                node_out_name = self.sub_module_name + "_" + "input_" + 
str(node_seq)
+                self.clml_code.append(
+                    self.MakeCLMLTensor.substitute(
+                        name=node_out_name,
+                        shape=shape,
+                        dtype=dtype,
+                        layout="CL_TENSOR_LAYOUT_OPTIMAL_QCOM",
+                    )
+                )
+                self.clml_code.append(
+                    self.MapInsert.substitute(nid=node_out_name, 
tensor_desc=node_out_name)
+                )
+                self.clml_code.append(
+                    
Template("runner.inputs.push_back($clml_input);").substitute(
+                        clml_input=node_out_name
+                    )
+                )
+                self.node_map[node_seq] = node_out_name
+                self.input_meta.append(
+                    self.MakeInputMetaInfo.substitute(
+                        in_name=node_out_name, dtype=dtype, shape=shape
+                    )
+                )
+            elif node["op"] == "kernel":
+                self.clml_code.append("// Kernel Node : " + node["name"])
+                if node["name"] == "nn.conv2d" or node["name"] == 
"nn.depthwise_conv2d":
+                    if "padding" in node["attrs"]:
+                        padding = str(tuple(int(x) for x in 
node["attrs"]["padding"][0]))[1:-1]
+                    else:
+                        padding = "0, 0, 0, 0"
+                    dilation = str(tuple(int(x) for x in 
node["attrs"]["dilation"][0]))[1:-1]
+                    strides = str(tuple(int(x) for x in 
node["attrs"]["strides"][0]))[1:-1]
+                    groups = node["attrs"]["groups"][0][0]
+                    if node["name"] == "nn.conv2d":
+                        mode = "CL_CONVOLUTION_MODE_CONVOLUTION_QCOM"
+                    else:
+                        mode = "CL_CONVOLUTION_MODE_DEPTHWISE_QCOM"
+                    activation = "CL_ACTIVATION_RELU"
+                    has_act = False
+                    if "activation_type" in node["attrs"]:
+                        has_act = True
+                        activation = node["attrs"]["activation_type"][0][0]
+                        if activation == "relu":
+                            activation = "CL_ACTIVATION_RELU"
+                        elif activation == "relu6":
+                            activation = "CL_ACTIVATION_RELU6"
+                        else:
+                            RuntimeError("Unknown activation:" + activation)
+                    has_bias = bool((node["inputs"] == 3) or (node["inputs"] 
== 7))
+                    has_bn = bool((node["inputs"] == 6) or (node["inputs"] == 
7))
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    weight_tensor = get_tensor_from_map(node["inputs"][1][0])
+                    if not has_bias:
+                        bias_tensor = "runner.unusedTensor"
+                    else:
+                        bias_tensor = get_tensor_from_map(node["inputs"][2][0])
+
+                    node_out_name = make_output_tensor(node, node_seq)
+
+                    if not has_bn:
+                        self.clml_code.append(
+                            self.MakeConv2D.substitute(
+                                input_tensor=input_tensor,
+                                weight_tensor=weight_tensor,
+                                bias_tensor=bias_tensor,
+                                output_tensor=node_out_name,
+                                padding=padding,
+                                dilation=dilation,
+                                strides=strides,
+                                groups=groups,
+                                mode=mode,
+                                activation=activation,
+                                has_bias="true" if has_bias else "false",
+                                has_act="true" if has_act else "false",
+                                dtype=node["attrs"]["dtype"][0][0],
+                            )
+                        )
+                    else:
+                        bn_index = 3 if has_bias else 2
+                        bn_attrs = tuple(node["attrs"]["batchnorm"][0][0])
+                        axis = bn_attrs[0]
+                        bn_shape = [1, 1, 1, 1]
+                        bn_node = self.nodes[node["inputs"][bn_index][0]]
+                        bn_shape[axis] = bn_node["attrs"]["shape"][0][0]
+
+                        bn_scale_tensor = get_tensor_from_map(
+                            node["inputs"][bn_index][0],
+                            shape=str(tuple(bn_shape))[1:-1],
+                            dtype=dtype,
+                        )
+
+                        bn_bias_tensor = get_tensor_from_map(
+                            node["inputs"][bn_index + 1][0],
+                            shape=str(tuple(bn_shape))[1:-1],
+                            dtype=dtype,
+                        )
+
+                        bn_mean_tensor = get_tensor_from_map(
+                            node["inputs"][bn_index + 2][0],
+                            shape=str(tuple(bn_shape))[1:-1],
+                            dtype=dtype,
+                        )
+
+                        bn_var_tensor = get_tensor_from_map(
+                            node["inputs"][bn_index + 3][0],
+                            shape=str(tuple(bn_shape))[1:-1],
+                            dtype=dtype,
+                        )
+
+                        self.clml_code.append(
+                            self.MakeConv2DWithBN.substitute(
+                                input_tensor=input_tensor,
+                                weight_tensor=weight_tensor,
+                                bias_tensor=bias_tensor,
+                                output_tensor=node_out_name,
+                                bn_scale_tensor=bn_scale_tensor,
+                                bn_bias_tensor=bn_bias_tensor,
+                                bn_mean_tensor=bn_mean_tensor,
+                                bn_var_tensor=bn_var_tensor,
+                                bn_attrs=str(bn_attrs)[1:-1],
+                                padding=padding,
+                                dilation=dilation,
+                                strides=strides,
+                                groups=groups,
+                                mode=mode,
+                                activation=activation,
+                                has_bias="true" if has_bias else "false",
+                                has_act="true" if has_act else "false",
+                                dtype=node["attrs"]["dtype"][0][0],
+                            )
+                        )
+                elif node["name"] == "nn.relu6" or node["name"] == "nn.relu":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    relu_type = (
+                        "CL_ACTIVATION_RELU" if node["name"] == "nn.relu" else 
"CL_ACTIVATION_RELU6"
+                    )
+                    self.clml_code.append(
+                        self.MakeRelu.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            relu_type=relu_type,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "nn.batch_norm":
+                    bn_attrs = tuple(node["attrs"]["batchnorm"][0][0])
+                    axis = bn_attrs[0]
+                    bn_shape = [1, 1, 1, 1]
+                    bn_node = self.nodes[node["inputs"][0][0]]
+                    bn_shape[axis] = bn_node["attrs"]["shape"][0][0]
+                    bn_scale_tensor = get_tensor_from_map(
+                        node["inputs"][0][0], 
shape=str(tuple(bn_shape))[1:-1], dtype=dtype
+                    )
+                    bn_bias_tensor = get_tensor_from_map(
+                        node["inputs"][1][0], 
shape=str(tuple(bn_shape))[1:-1], dtype=dtype
+                    )
+                    bn_mean_tensor = get_tensor_from_map(
+                        node["inputs"][2][0], 
shape=str(tuple(bn_shape))[1:-1], dtype=dtype
+                    )
+                    bn_var_tensor = get_tensor_from_map(
+                        node["inputs"][3][0], 
shape=str(tuple(bn_shape))[1:-1], dtype=dtype
+                    )
+
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+
+                    self.clml_code.append(
+                        self.MakeBN.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            bn_scale_tensor=bn_scale_tensor,
+                            bn_bias_tensor=bn_bias_tensor,
+                            bn_mean_tensor=bn_mean_tensor,
+                            bn_var_tensor=bn_var_tensor,
+                            bn_attrs=str(bn_attrs)[1:-1],
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] in ["nn.max_pool2d", "nn.avg_pool2d", 
"nn.l2_pool2d"]:
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    pool_size = str(tuple(int(x) for x in 
node["attrs"]["pool_size"][0]))[1:-1]
+                    strides = str(tuple(int(x) for x in 
node["attrs"]["strides"][0]))[1:-1]
+                    padding = str(tuple(int(x) for x in 
node["attrs"]["padding"][0]))[1:-1]
+                    self.clml_code.append(
+                        self.MakePool2D.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            pool_size=pool_size,
+                            strides=strides,
+                            padding=padding,
+                            pool_type=node["name"],
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] in ["nn.global_max_pool2d", 
"nn.global_avg_pool2d"]:
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    in_node = self.nodes[node["inputs"][0][0]]
+                    in_shape = 
str(tuple(in_node["attrs"]["shape"][0][0]))[1:-1]
+                    self.clml_code.append(
+                        self.MakeGlobalPool2D.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            in_shape=in_shape,
+                            pool_type=node["name"],
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "reshape":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    self.clml_code.append(
+                        self.MakeReshape.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "concatenate":
+                    input_len = len(node["inputs"])
+                    in_list = str(
+                        [get_tensor_from_map(node["inputs"][x][0]) for x in 
range(input_len)]
+                    )[1:-1]
+                    node_out_name = make_output_tensor(node, node_seq)
+                    axis = node["attrs"]["axis"][0][0]
+                    self.clml_code.append(
+                        self.MakeConcatenate.substitute(
+                            in_list=in_list,
+                            output_tensor=node_out_name,
+                            axis=axis,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "nn.dense":
+                    in_node = self.nodes[node["inputs"][0][0]]
+                    in_shape = tuple(in_node["attrs"]["shape"][0][0])
+                    wt_shape = tuple(in_node["attrs"]["shape"][0][0])
+                    input_tensor = get_tensor_from_map(
+                        node["inputs"][0][0], shape=str(tuple([1, in_shape[1], 
1, 1]))[1:-1]
+                    )
+                    weight_tensor = get_tensor_from_map(
+                        node["inputs"][1][0],
+                        shape=str(tuple([wt_shape[0], wt_shape[1], 1, 
1]))[1:-1],
+                    )
+                    if len(node["inputs"]) == 3:
+                        bias_tensor = "runner.unusedTensor"
+                    else:
+                        bias_tensor = get_tensor_from_map(node["inputs"][2][0])
+
+                    node_out_name = make_output_tensor(
+                        node, node_seq, shape=str(tuple([1, wt_shape[0], 1, 
1]))[1:-1]
+                    )
+                    self.clml_code.append(
+                        self.MakeDense.substitute(
+                            input_tensor=input_tensor,
+                            weight_tensor=weight_tensor,
+                            output_tensor=node_out_name,
+                            bias_tensor=bias_tensor,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "nn.softmax":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    self.clml_code.append(
+                        self.MakeSoftMax.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "nn.pad":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    pad_mode = node["attrs"]["pad_mode"][0][0]
+                    padding = str(tuple(int(x) for x in 
node["attrs"]["pad_width"][0]))[1:-1]
+                    self.clml_code.append(
+                        self.MakePad.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            pad_mode=pad_mode,
+                            padding=padding,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "nn.batch_flatten":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    self.clml_code.append(
+                        self.MakeBatchFlatten.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] == "clip":
+                    input_tensor = get_tensor_from_map(node["inputs"][0][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    a_max = node["attrs"]["a_max"][0][0]
+                    a_min = node["attrs"]["a_min"][0][0]
+                    self.clml_code.append(
+                        self.MakeClip.substitute(
+                            input_tensor=input_tensor,
+                            output_tensor=node_out_name,
+                            a_max=a_max,
+                            a_min=a_min,
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                elif node["name"] in [
+                    "add",
+                    "subtract",
+                    "multiply",
+                    "minimum",
+                    "maximum",
+                    "divide",
+                ]:
+                    input_a = get_tensor_from_map(node["inputs"][0][0])
+                    input_b = get_tensor_from_map(node["inputs"][1][0])
+                    node_out_name = make_output_tensor(node, node_seq)
+                    self.clml_code.append(
+                        self.MakeBinaryOp.substitute(
+                            input_a=input_a,
+                            input_b=input_b,
+                            output_tensor=node_out_name,
+                            op=node["name"],
+                            dtype=node["attrs"]["dtype"][0][0],
+                        )
+                    )
+                else:
+                    RuntimeError("Unsupported Op:" + node["name"])
+                self.clml_code.append(
+                    self.MapInsert.substitute(nid=node_out_name, 
tensor_desc=node_out_name)
+                )
+                self.node_map[node_seq] = node_out_name
+
+            elif node["op"] != "const":
+                print("Unknown Node type:", node["op"])
+
+        # Populate outputs
+        out_nodes = self.codegen["heads"]
+        self.clml_code.append("// Populate outputs")
+        for nid_triple in out_nodes:
+            nid = nid_triple[0]
+            out_node = self.nodes[nid]
+            dtype = str(out_node["attrs"]["dtype"][0][0])
+            shape = str(tuple(out_node["attrs"]["shape"][0][0]))[1:-1]
+            out_name = self.sub_module_name + "_" + "layer_out_" + str(nid)
+            self.clml_code.append(
+                Template(
+                    'runner.outputs.insert({"$out_name", 
runner.storage_map["$out_name"]});'
+                ).substitute(out_name=out_name)
+            )
+            self.clml_code.append(
+                Template('runner.outputs_dtypes.insert({"$out_name", 
"$dtype"});').substitute(
+                    out_name=out_name, dtype=dtype
+                )
+            )
+            self.clml_code.append(
+                Template(
+                    "runner.outputs_shapes.insert" '({"$out_name", 
std::vector<size_t>({$shape})});'
+                ).substitute(out_name=out_name, shape=shape)
+            )
+            self.output_meta.append(
+                self.MakeOutputMetaInfo.substitute(out_name=out_name, 
dtype=dtype, shape=shape)
+            )
+
+        # Mem allocation & Param copy
+        self.clml_code.append("// Allocate Tensor Memory and copy params")
+        self.clml_code.append("runner.AllocateMemAndPopulateParams();")
+
+        # Meta data preparation
+        self.clml_code.append(
+            self.MakeMetaInfo.substitute(
+                name=self.sub_module_name,
+                input_count=len(self.input_meta),
+                output_count=len(self.output_meta),
+                input_meta="\n".join(self.input_meta),
+                output_meta="\n".join(self.output_meta),
+            )
+        )
+
+        self.clml_code.append(self.MakeFooter.substitute())
+        return (self.sub_module_name, self.clml_code)
+
+
+class CLMLGenSrc:
+    """Generates CLML API source given a TVM compiled mod"""
+
+    def __init__(self, libm):
+        """Initialize
+        Parameters
+        ----------
+        libm : Module
+            Compiled relay module
+        """
+        self.libm = libm
+        self.gen_src = []
+        self.clml_modules = None
+        self.clml_builds = {}
+        self.codegen = None
+        self.nodes = None
+
+        self.MakeFileHeader = Template(
+            """/*
+        * Licensed to the Apache Software Foundation (ASF) under one
+        * or more contributor license agreements.  See the NOTICE file
+        * distributed with this work for additional information
+        * regarding copyright ownership.  The ASF licenses this file
+        * to you under the Apache License, Version 2.0 (the
+        * "License"); you may not use this file except in compliance
+        * with the License.  You may obtain a copy of the License at
+        *
+        *   http://www.apache.org/licenses/LICENSE-2.0
+        *
+        * Unless required by applicable law or agreed to in writing,
+        * software distributed under the License is distributed on an
+        * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+        * KIND, either express or implied.  See the License for the
+        * specific language governing permissions and limitations
+        * under the License.
+        */
+
+        /*!
+         * \\file clml_models.cc
+         * \\brief CLML models for all subgraph in given TVM module.
+         */
+
+        // AUTO GENERATED BY TOOL (clml_codegen.py), PLEASE DO NOT CHANGE THIS 
FILE!
+        // 
=========================================================================
+
+        #include <iostream>
+        #include <fstream>
+
+        #include <vector>
+        #include <string>
+        #include <algorithm>
+        #include <math.h>
+        #include <list>
+
+        // Project includes
+        #include "CL/cl.h"
+        #include "CL/cl_qcom_ml_ops.h"
+
+        #include "clml_runner.h"
+
+        using namespace tvm::runtime;
+        """
+        )
+
+    def get_clml_params(self):
+        """Returns parameters from the TVM module"""
+
+        clml_params = {}
+        if self.libm.get_lib().type_key == "const_loader":
+            params = 
self.libm.get_lib().get_function("get_const_var_ndarray")()
+            clml_params.update(params)
+
+        for mod in self.libm.get_lib().imported_modules:
+            if mod.type_key == "const_loader":
+                params = mod.get_const_var_ndarray()
+                clml_params.update(params)
+
+        clml_params_save = {}
+        for key, val in clml_params.items():
+            clml_params_save[str(key)] = val.numpy()
+
+        return clml_params_save
+
+    def get_artifacts(self):
+        """Function that returns params as dict and source as list of cource 
code lines"""
+
+        self.clml_modules = list(
+            filter(lambda mod: mod.type_key == "clml", 
self.libm.get_lib().imported_modules)
+        )
+        self.clml_builds["file_header"] = [self.MakeFileHeader.substitute()]
+
+        for cmod in self.clml_modules:
+            (sub_module_name, clml_code) = CLMLGetSubModuleSrc(cmod).get_src()
+            self.clml_builds[sub_module_name] = clml_code
+
+        main_code = []
+        main_code.append(
+            """
+            std::vector<CLMLRunner> BuildModules(ToolArgs& args,
+                                                 cl_platform_id arg_platform,
+                                                 cl_context arg_context,
+                                                 cl_device_id arg_device_id,
+                                                 cl_command_queue arg_queue) {
+                  std::vector<CLMLRunner> runners;"""
+        )
+        for key, val in self.clml_builds.items():
+            if key != "file_header":
+                main_code.append(
+                    "runners.push_back("
+                    + key
+                    + '("'
+                    + key
+                    + '", args, arg_platform, arg_context, arg_device_id, 
arg_queue));'
+                )
+        main_code.append("return runners;}")
+        self.clml_builds["MainBuild"] = main_code
+
+        for key, val in self.clml_builds.items():
+            self.gen_src.extend(val)
+
+        return (self.get_clml_params(), self.gen_src)
diff --git a/src/relay/backend/contrib/codegen_json/codegen_json.h 
b/src/relay/backend/contrib/codegen_json/codegen_json.h
index c1cde2a03b..350a1275ae 100644
--- a/src/relay/backend/contrib/codegen_json/codegen_json.h
+++ b/src/relay/backend/contrib/codegen_json/codegen_json.h
@@ -340,6 +340,7 @@ class JSONSerializer : public 
MemoizedExprTranslator<std::vector<JSONGraphNodeEn
       node_row_ptr.push_back(num_entry);
     }
     writer->BeginObject();
+    writer->WriteObjectKeyValue("symbol", symbol_);
     writer->WriteObjectKeyValue("nodes", nodes_);
     writer->WriteObjectKeyValue("arg_nodes", arg_nodes);
     writer->WriteObjectKeyValue("heads", heads_);
diff --git a/src/runtime/const_loader_module.cc 
b/src/runtime/const_loader_module.cc
index a8028e616c..f57c7d11d5 100644
--- a/src/runtime/const_loader_module.cc
+++ b/src/runtime/const_loader_module.cc
@@ -77,6 +77,16 @@ class ConstLoaderModuleNode : public ModuleNode {
       initialized_[name] = true;
     }
 
+    if (name == "get_const_var_ndarray") {
+      return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {
+        Map<String, ObjectRef> ret_map;
+        for (const auto& kv : const_var_ndarray_) {
+          ret_map.Set(kv.first, kv.second);
+        }
+        *rv = ret_map;
+      });
+    }
+
     // Run the module.
     // Normally we would only have a limited number of submodules. The runtime
     // symobl lookup overhead should be minimal.
diff --git a/src/runtime/contrib/json/json_runtime.h 
b/src/runtime/contrib/json/json_runtime.h
index 3a02202b87..c84e659c6b 100644
--- a/src/runtime/contrib/json/json_runtime.h
+++ b/src/runtime/contrib/json/json_runtime.h
@@ -228,6 +228,7 @@ class JSONRuntimeBase : public ModuleNode {
   void Load(dmlc::JSONReader* reader) {
     reader->BeginObject();
     std::string key;
+    std::string symbol_;
     while (reader->NextObjectItem(&key)) {
       if (key == "nodes") {
         reader->Read(&nodes_);
@@ -237,6 +238,8 @@ class JSONRuntimeBase : public ModuleNode {
         reader->Read(&node_row_ptr_);
       } else if (key == "heads") {
         reader->Read(&outputs_);
+      } else if (key == "symbol") {
+        reader->Read(&symbol_);
       } else {
         LOG(FATAL) << "Unknown key: " << key;
       }

[tvm] branch main updated: [CLML][CODEGEN] CLML native codegen utility (#13837)

Reply via email to