wuyii8941 opened a new issue, #19520:
URL: https://github.com/apache/tvm/issues/19520

   
   ## Reproduction
   
   ```python
   import tvm
   from tvm import relax
   import tvm.relax.op as R
   from tvm.s_tir import dlight
   
   bb = relax.BlockBuilder()
   x = relax.Var("x", relax.TensorStructInfo((1, 3, 8, 8), "float32"))
   with bb.function("main", [x]):
       with bb.dataflow():
           out = bb.emit(R.nn.adaptive_avg_pool2d(x, output_size=(3, 3)))  # 8 
% 3 != 0
           gv = bb.emit_output(out)
       bb.emit_func_output(gv)
   mod = bb.get()
   
   pipeline = tvm.ir.transform.Sequential([
       relax.transform.LegalizeOps(),
       dlight.ApplyDefaultSchedule(dlight.gpu.Fallback()),
   ])
   with tvm.target.Target("cuda"):
       mod_cuda = pipeline(mod)
   
   tvm.relax.build(mod_cuda, target="cuda")  # RuntimeError
   ```
   
   ## Error message
   
   ```
   RuntimeError: Memory verification failed with the following errors:
       Variable `x` is directly accessed by host memory
       (it is not contained in a thread environment or in the function 
arguments)
   ```
   
   ## Expected behavior
   
   `adaptive_avg_pool2d` should compile and run on CUDA for any valid output 
size, as it does on CPU (`llvm`).
   
   ## Trigger condition
   
   The crash occurs when the output size does not evenly divide the input 
spatial dimensions. Even divisions compile and run correctly.
   
   | Input spatial | Output size | H % oh == 0? | CUDA |
   |:---:|:---:|:---:|:---:|
   | 8×8 | 4×4 | Yes | OK |
   | 8×8 | 2×2 | Yes | OK |
   | 8×8 | 1×1 | Yes | OK |
   | 8×8 | 3×3 | No | **Crash** |
   | 8×8 | 5×5 | No | **Crash** |
   | 16×16 | 3×3 | No | **Crash** |
   | 7×7 | 3×3 | No | **Crash** |
   | 7×7 | 1×1 | Yes | OK |
   
   All cases above compile and run correctly on CPU (`llvm`).
   
   ## Analysis
   
   When the pool window is non-uniform (i.e., different output positions 
require different window sizes), the TIR produced by `topi.nn.adaptive_pool` 
contains conditional branches with variable-extent reduction loops. After 
`dlight.gpu.Fallback()` scheduling, these loops are not bound to GPU threads, 
causing the memory verifier to reject the generated code.
   
   ## Environment
   
   - TVM: main branch (commit `0b0afd8dd`)
   - Target: `cuda`
   - Python: 3.11
   - OS: Ubuntu Linux
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to