kk2049 opened a new issue #9715:
URL: https://github.com/apache/tvm/issues/9715


   ### My problem
   I am trying to use autoscheduler to generate CUDA source code for backward 
stage for NCHW winograd_conv2d. Due to some bugs in 
topi.cuda.conv2d_winograd.winograd_cuda, I copied some code to build my 
workload.
   
   Luckily, this workload works without te.gradient and can successfully get 
source code for the forward stage. But when I add te.gradient, this workload no 
longer works and I get an error msg below: `Check failed: 
(!repl_op.same_as(s->op)) is false: Cannot find Tensor(shape=[4, 2], op.name=A) 
in the inputs of compute(extracted_tensor.d.shared, ......`
   
   I am really confued now. Forward stage codegen can work proves that my 
workload is correct in some way. So I think this bug may caused by a bug in 
TVM, but I am not sure.
   
   Maybe someone can help me find out whether it is a bug about TVM. 
   
   Thanks a lot!!!
   
   ### Expected behavior
   
   This code should find a valid schedule 
   ### Actual behavior
   
   I get a error below when I start tunning. 
   ```
   Get devices for measurement successfully!
   ----------------------------------------------------------------------
   ------------------------------  [ Search ]
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "bug_scheduler.py", line 189, in <module>
       task.tune(tune_option)
     File 
"/data/anaconda3/envs/env3.7/lib/python3.7/site-packages/tvm-0.8.0-py3.7-linux-x86_64.egg/tvm/auto_scheduler/search_task.py",
 line 498, in tune
       _ffi_api.AutoSchedule(search_policy, tuning_options)
     File 
"/data/anaconda3/envs/env3.7/lib/python3.7/site-packages/tvm-0.8.0-py3.7-linux-x86_64.egg/tvm/_ffi/_ctypes/packed_func.py",
 line 237, in __call__
       raise get_last_ffi_error()
   tvm._ffi.base.TVMError: Traceback (most recent call last):
     13: TVMFuncCall
     12: std::_Function_handler<void (tvm::runtime::TVMArgs, 
tvm::runtime::TVMRetValue*), 
tvm::runtime::TypedPackedFunc<tvm::runtime::Array<tvm::runtime::ObjectRef, 
void> (tvm::auto_scheduler::SearchPolicy, 
tvm::auto_scheduler::TuningOptions)>::AssignTypedLambda<tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy,
 
tvm::auto_scheduler::TuningOptions)#3}>(tvm::auto_scheduler::{lambda(tvm::auto_scheduler::SearchPolicy,
 tvm::auto_scheduler::TuningOptions)#3}, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs 
const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, 
tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
     11: tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, 
tvm::auto_scheduler::TuningOptions)
     10: tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, 
tvm::auto_scheduler::ProgramMeasurer)
     9: tvm::auto_scheduler::SketchPolicyNode::SearchOneRound(int, 
tvm::runtime::Array<tvm::auto_scheduler::State, void>*)
     8: tvm::auto_scheduler::SketchPolicyNode::GenerateSketches()
     7: 
tvm::auto_scheduler::RuleAddCacheRead::Apply(tvm::auto_scheduler::SketchPolicyNode
 const&, tvm::auto_scheduler::State const&, int) const
     6: tvm::auto_scheduler::State::cache_read(int, tvm::runtime::String 
const&, tvm::runtime::Array<tvm::Integer, void> const&, 
tvm::auto_scheduler::ComputeDAG const&)
     5: 
tvm::auto_scheduler::CacheReadStepNode::ApplyToState(tvm::auto_scheduler::State*,
 tvm::auto_scheduler::ComputeDAG const&) const
     4: 
tvm::auto_scheduler::ComputeDAG::ReplayAndGetDAG(tvm::runtime::Array<tvm::auto_scheduler::Step,
 void> const&) const
     3: 
tvm::auto_scheduler::ComputeDAG::ApplySteps(tvm::runtime::Array<tvm::auto_scheduler::Step,
 void> const&, tvm::runtime::Array<tvm::te::Stage, void>*, 
tvm::runtime::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, 
tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, 
tvm::auto_scheduler::LayoutRewriteOption) const
     2: tvm::auto_scheduler::StepApplyToSchedule(tvm::auto_scheduler::Step 
const&, tvm::runtime::Array<tvm::te::Stage, void>*, 
tvm::runtime::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, 
tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::te::Schedule*, 
tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&)
     1: 
tvm::auto_scheduler::CacheReadStepNode::ApplyToSchedule(tvm::runtime::Array<tvm::te::Stage,
 void>*, tvm::runtime::Map<tvm::te::Stage, 
tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, 
tvm::runtime::ObjectEqual>*, tvm::te::Schedule*) const
     0: tvm::te::Schedule::cache_read(tvm::te::Tensor const&, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
const&, tvm::runtime::Array<tvm::te::Operation, void> const&)
     File 
"/data/apache-tvm-src-v0.8.0.rc0/src/te/schedule/schedule_dataflow_rewrite.cc", 
line 168
   TVMError: 
   ---------------------------------------------------------------
   An error occurred during the execution of TVM.
   For more information, please see: https://tvm.apache.org/docs/errors.html
   ---------------------------------------------------------------
     Check failed: (!repl_op.same_as(s->op)) is false: Cannot find 
Tensor(shape=[4, 2], op.name=A) in the inputs of 
compute(extracted_tensor.d.shared, body=[extracted_tensor[ax0, ax1, ax2, ax3]], 
axis=[iter_var(ax0, range(min=0, ext=2)), iter_var(ax1, range(min=0, ext=2)), 
iter_var(ax2, range(min=0, ext=4)), iter_var(ax3, range(min=0, ext=4))], 
reduce_axis=[], tag=, attrs={})
   
   ```
   ### Environment
   
   My system is Ubuntun16.04
   CUDA version is 10.2
   My tvm version is 0.8.0. I build it with the source code from Download 
Apache TVM Source Code web page.
   
   ### Steps to reproduce
   
   I am sorry about put such a long code here to make sure this bug can be 
reproduced. I have tried to cut out some part of my code to reproduce this 
error, but this bug can only be triggered by this long code.
   ``` python
   import os
   
   import numpy as np
   import tvm
   from tvm import auto_scheduler
   
   import logging
   from tvm import te, topi
   from tvm import autotvm
   
   from tvm.topi import nn
   from tvm.topi.utils import get_const_int, get_const_tuple, traverse_inline
   from tvm.topi.nn.winograd_util import winograd_transform_matrices
   from tvm.topi.nn.conv2d import conv2d_winograd_nhwc, 
_conv2d_winograd_nhwc_impl
   import sys
   import numpy as np
   from tvm.topi.testing import conv2d_nchw_python
   
   def _infer_tile_size(data, kernel, layout="NCHW"):
       if layout == "NCHW":
           N, CI, H, W = get_const_tuple(data.shape)
       else:
           assert layout == "NHWC"
           N, H, W, CI = get_const_tuple(data.shape)
   
       if H % 8 == 0:
           return 4
       return 2
   
   @auto_scheduler.register_workload
   def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
       data = te.placeholder((N, CI, H, W), name="data")
       kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
   
       stride = (1,1)
       padding = (1,1)
       dilation = (1,1)
       pre_computed = False
       out_dtype = "float32"
   
       tile_size = _infer_tile_size(data, kernel)
       N, CI, H, W = get_const_tuple(data.shape)
   
       if isinstance(N, tvm.tir.Any):
           N = tvm.te.size_var("n")
   
       if not isinstance(H, int) or not isinstance(W, int):
           raise RuntimeError(
               "cuda winograd conv2d doesn't support dynamic input\
                              height or width."
           )
   
       if isinstance(dilation, int):
           dilation_h = dilation_w = dilation
       else:
           dilation_h, dilation_w = dilation
       HSTR, WSTR = (stride, stride) if isinstance(stride, int) else stride
   
       if not pre_computed:  # kernel tensor is raw tensor, do strict check
           if dilation_h != 1 or dilation_w != 1:
               kernel = nn.dilate(kernel, (1, 1, dilation_h, dilation_w))
           CO, CI, KH, KW = get_const_tuple(kernel.shape)
           alpha = KW + tile_size - 1
           assert HSTR == 1 and WSTR == 1 and KH == KW
       else:
           # kernel tensor is pre-transfomred. this op is created by alter op 
layout.
           # dilation is not supported
           alpha, _, CI, CO = get_const_tuple(kernel.shape)
           KH = KW = alpha + 1 - tile_size
           assert HSTR == 1 and WSTR == 1 and dilation_h == 1 and dilation_w == 
1
   
       pt, pl, pb, pr = nn.get_pad_tuple(padding, (KH, KW))
       data_pad = nn.pad(data, (0, 0, pt, pl), (0, 0, pb, pr), name="data_pad")
   
       r = KW
       m = tile_size
       A, B, G = winograd_transform_matrices(m, r, out_dtype)
   
       H = (H + pt + pb - KH) // HSTR + 1
       W = (W + pl + pr - KW) // WSTR + 1
       nH, nW = (H + m - 1) // m, (W + m - 1) // m
   
       P = N * nH * nW if isinstance(N, int) else nH * nW
   
       # transform kernel
       if not pre_computed:
           r_kh = te.reduce_axis((0, KH), name="r_kh")
           r_kw = te.reduce_axis((0, KW), name="r_kw")
           kernel_pack = te.compute(
               (alpha, alpha, CI, CO),
               lambda eps, nu, ci, co: te.sum(
                   kernel[co][ci][r_kh][r_kw] * G[eps][r_kh] * G[nu][r_kw], 
axis=[r_kh, r_kw]
               ),
               name="my_kernel_pack",
           )
       else:
           kernel_pack = kernel    
       
       idxdiv = tvm.tir.indexdiv
       idxmod = tvm.tir.indexmod
       # pack input tile
       input_tile = te.compute(
           (CI, P, alpha, alpha),
           lambda c, p, eps_1, nu_1: data_pad[idxdiv(p, (nH * nW))][c][
               idxmod(idxdiv(p, nW), nH) * m + eps_1
           ][idxmod(p, nW) * m + nu_1],
           name="my_d",
       )
   
       # dy = tvm.te.placeholder(input_tile.shape, name="input2_dy")
       # [dw] = tvm.te.gradient(input_tile, [data], head=dy)
       # return [data, kernel, input_tile, dy, dw]
   
       # transform data
       r_a = te.reduce_axis((0, alpha), "r_a")
       r_b = te.reduce_axis((0, alpha), "r_b")
       data_pack = te.compute(
           (alpha, alpha, CI, P),
           lambda eps, nu, ci, p: te.sum(
               input_tile[ci][p][r_a][r_b] * B[r_a][eps] * B[r_b][nu], 
axis=[r_a, r_b]
           ),
           name="my_data_pack",
       )
   
       # dy = tvm.te.placeholder(data_pack.shape, name="input2_dy")
       # [dw] = tvm.te.gradient(data_pack, [data], head=dy)
       # return [data, kernel, data_pack, dy, dw]
   
       # do batch gemm
       ci = te.reduce_axis((0, CI), name="ci")
       bgemm = te.compute(
           (alpha, alpha, CO, P),
           lambda eps, nu, co, p: te.sum(
               kernel_pack[eps][nu][ci][co] * data_pack[eps][nu][ci][p], 
axis=[ci]
           ),
           name="my_bgemm",
       )
       # inverse transform
       r_a_2 = te.reduce_axis((0, alpha), "r_a_2")
       r_b_2 = te.reduce_axis((0, alpha), "r_b_2")
       inverse = te.compute(
           (CO, P, m, m),
           lambda co, p, vh, vw: te.sum(
               bgemm[r_a_2][r_b_2][co][p] * A[r_a_2][vh] * A[r_b_2][vw], 
axis=[r_a_2, r_b_2]
           ),
           name="my_inverse",
       )
   
       # output
       output = te.compute(
           (N, CO, H, W),
           lambda n, co, h, w: inverse[
               co, n * nH * nW + idxdiv(h, m) * nW + idxdiv(w, m), idxmod(h, 
m), idxmod(w, m)
           ],
           name="my_output",
           tag="conv2d_nchw_winograd",
       )
       
       dy = tvm.te.placeholder(output.shape, name="input2_dy")
       [dw] = tvm.te.gradient(output, [data], head=dy)
       return [data, kernel, output,dy,dw]
       # return [data, kernel, output]
   
   target = tvm.target.Target("cuda")
   
   # Use the last layer in ResNet-50
   N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), 
(1, 1)
   task = auto_scheduler.SearchTask(
       func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), 
target=target
   )
   
   # Inspect the computational graph
   print("Computational DAG:")
   print(task.compute_dag)
   
   log_file = "conv2d.json"
   if os.path.exists(log_file):
       os.remove(log_file)
   measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
   tune_option = auto_scheduler.TuningOptions(
       num_measure_trials=10,  # change this to 1000 to achieve the best 
performance
       runner=measure_ctx.runner,
       measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
       verbose=2,
   )
   
   # Run auto-tuning (search)
   task.tune(tune_option)
   # Apply the best schedule
   sch, args = task.apply_best(log_file)
   
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to