[GitHub] [incubator-tvm] zhanghaohit commented on a change in pull request #6126: [VTA][OpenCL] intelfocl

2020-08-26 Thread GitBox


zhanghaohit commented on a change in pull request #6126:
URL: https://github.com/apache/incubator-tvm/pull/6126#discussion_r478102076



##
File path: vta/runtime/runtime.cc
##
@@ -329,7 +442,7 @@ class BaseQueue {
   // End location of current SRAM write in FIFO mode
   uint32_t sram_end_{0};
   // The buffer in DRAM
-  std::vector dram_buffer_;
+  std::vector> dram_buffer_;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-tvm] zhanghaohit commented on a change in pull request #6126: [VTA][OpenCL] intelfocl

2020-07-29 Thread GitBox


zhanghaohit commented on a change in pull request #6126:
URL: https://github.com/apache/incubator-tvm/pull/6126#discussion_r462007395



##
File path: src/tir/transforms/lower_tvm_builtin.cc
##
@@ -88,16 +88,6 @@ class BuiltinLower : public StmtExprMutator {
 op = stmt.as();
 // Get constant allocation bound.
 int64_t nbytes = GetVectorBytes(op->dtype);

Review comment:
   Original code will fail if there are multiple workloads in one schedule. 
For example, in `fused_nn_conv2d_add_add_right_shift_clip_cast_31`, the 
`conv2d` and `add` may both have `workload` attrs. We have to get the correct 
workload by comparing the `task_name`.
   
   Previously it works fine, as `add` is not a tunable op. But since we also 
want to put middle alu-only nodes (residual blocks) to VTA, such as 
`fused_cast_cast_add_nn_relu_clip_cast_3`. We create a vta schedule for `add` 
(see 
[add.alu](https://github.com/apache/incubator-tvm/blob/a1daa1c47d2a19e51ab96da5c20e187f3bbf3413/vta/python/vta/top/op.py#L166))

##
File path: python/tvm/autotvm/task/topi_integration.py
##
@@ -227,17 +227,21 @@ def wrapper(outs, *args, **kwargs):
 return _decorate
 
 
-def get_workload(outs):
+def get_workload(outs, task_name=None):
 """Retrieve the workload from outputs"""
 def traverse(tensors):
 """traverse all ops to find attached workload"""
 for t in tensors:
 op = t.op
-if 'workload' in op.attrs:
-return args_to_workload(op.attrs['workload'])
 wkl = traverse(op.input_tensors)
 if wkl:
 return wkl
+

Review comment:
   Original code will fail if there are multiple workloads in one schedule. 
For example, in `fused_nn_conv2d_add_add_right_shift_clip_cast_31`, the 
`conv2d` and `add` may both have `workload` attrs. We have to get the correct 
workload by comparing the `task_name`.
   
   Previously it works fine, as `add` is not a tunable op. But since we also 
want to put middle alu-only nodes (residual blocks) to VTA, such as 
`fused_cast_cast_add_nn_relu_clip_cast_3`. We create a vta schedule for `add` 
(see 
[add.alu](https://github.com/apache/incubator-tvm/blob/a1daa1c47d2a19e51ab96da5c20e187f3bbf3413/vta/python/vta/top/op.py#L166))

##
File path: src/relay/backend/compile_engine.cc
##
@@ -230,7 +230,7 @@ class ScheduleGetter : public 
backend::MemoizedExprTranslator>
   << "Two complicated op in a primitive function "
   << " master=" << master_op_ << " current=" << op;
 }
-if (op_pattern >= master_op_pattern_) {
+if (op_pattern > master_op_pattern_) {

Review comment:
   By this change, the op visited first will have high priority to be the 
master_op, compared with the ops with the same op_pattern. I think it is true 
that the front ops are more important.
   
   This change is also due to the introduction of ALU-only node, e.g., 
`fused_cast_cast_add_left_shift_add_nn_relu_add_right_shift_cast_2`. In this 
case, we'll choose the first add as the master_op. The last `add` is actually 
an addition of a tensor and const. If we choose the last `add` as the 
master_op, the autotune config may cause ALU buffer overflow, as we actually 
need more ALU buffer for the whole fused op.

##
File path: src/relay/backend/compile_engine.cc
##
@@ -288,7 +288,7 @@ class ScheduleGetter : public 
backend::MemoizedExprTranslator>
   tvm::Target target_;
   Op master_op_;
   Attrs master_attrs_;
-  int master_op_pattern_{0};
+  int master_op_pattern_{-1};

Review comment:
   This is due to the change here:
   ```
   if (op_pattern >= master_op_pattern_) {
   ```
   to
   ```
   if (op_pattern > master_op_pattern_) {
   ```

##
File path: src/tir/transforms/lower_tvm_builtin.cc
##
@@ -88,16 +88,6 @@ class BuiltinLower : public StmtExprMutator {
 op = stmt.as();
 // Get constant allocation bound.
 int64_t nbytes = GetVectorBytes(op->dtype);

Review comment:
   This removes special handling for kDLCPU. Otherwise, it may cause LLVM 
parameters match error.
   
   ```bash
   Traceback (most recent call last):
 File "vta/tutorials/frontend/deploy_classification.py", line 210, in 

   params=params, target_host=env.target_host)
 File 
"/4pd/home/zhanghao/workspace/tvm-2/tvm/python/tvm/relay/build_module.py", line 
251, in build
   graph_json, mod, params = bld_mod.build(mod, target, target_host, params)
 File 
"/4pd/home/zhanghao/workspace/tvm-2/tvm/python/tvm/relay/build_module.py", line 
120, in build
   self._build(mod, target, target_host)
 File "tvm/_ffi/_cython/./packed_func.pxi", line 321, in 
tvm._ffi._cy3.core.PackedFuncBase.__call__
 File "tvm/_ffi/_cython/./packed_func.pxi", line 256, in 
tvm._ffi._cy3.core.FuncCall
 File "tvm/_ffi/_cython/./packed_func.pxi", line 245, in 
tvm._ffi._cy3.core.FuncCall3
 File "tvm/_ffi/_cython/./base.pxi", line