wuyii8941 opened a new issue, #19524:
URL: https://github.com/apache/tvm/issues/19524

   # [Bug][Relax] `conv2d_transpose` produces wrong results on CUDA when 
`output_padding > 0`
   
   ## Reproduction
   
   ```python
   import numpy as np
   import tvm
   from tvm import relax
   import tvm.relax.op as R
   from tvm.s_tir import dlight
   
   bb = relax.BlockBuilder()
   x = relax.Var("x", relax.TensorStructInfo((1, 16, 7, 7), "float32"))
   w = relax.Var("w", relax.TensorStructInfo((16, 3, 3, 3), "float32"))
   with bb.function("main", [x, w]):
       with bb.dataflow():
           out = bb.emit(R.nn.conv2d_transpose(x, w,
               strides=(2, 2), padding=(1, 1), output_padding=(1, 1)))
           gv = bb.emit_output(out)
       bb.emit_func_output(gv)
   mod = bb.get()
   
   np.random.seed(0)
   x_np = np.random.randn(1, 16, 7, 7).astype("float32")
   w_np = np.random.randn(16, 3, 3, 3).astype("float32") * 0.01
   
   # CPU
   pipeline_cpu = tvm.ir.transform.Sequential([relax.transform.LegalizeOps()])
   exe_cpu = tvm.relax.build(pipeline_cpu(mod), target="llvm")
   vm_cpu = tvm.relax.VirtualMachine(exe_cpu, device=tvm.cpu())
   out_cpu = vm_cpu["main"](
       tvm.runtime.tensor(x_np, device=tvm.cpu()),
       tvm.runtime.tensor(w_np, device=tvm.cpu())).numpy()
   
   # CUDA
   pipeline_cuda = tvm.ir.transform.Sequential([
       relax.transform.LegalizeOps(),
       dlight.ApplyDefaultSchedule(dlight.gpu.Fallback()),
   ])
   with tvm.target.Target("cuda"):
       mod_cuda = pipeline_cuda(mod)
   exe_cuda = tvm.relax.build(mod_cuda, target="cuda")
   vm_cuda = tvm.relax.VirtualMachine(exe_cuda, device=tvm.cuda())
   out_cuda = vm_cuda["main"](
       tvm.runtime.tensor(x_np, device=tvm.cuda()),
       tvm.runtime.tensor(w_np, device=tvm.cuda())).numpy()
   
   print(f"max |CPU - CUDA| = {np.abs(out_cpu - out_cuda).max():.6e}")
   # Expected: ~1e-7 (float rounding)
   # Actual:   ~3e-1 (wrong result)
   ```
   
   ## Expected behavior
   
   `conv2d_transpose` should produce the same result on CPU and CUDA (within 
floating-point tolerance).
   
   ## Actual behavior
   
   CUDA output is silently wrong. The maximum absolute difference is ~0.3 
(relative error >100%), confirmed by comparing against PyTorch 
`F.conv_transpose2d` as ground truth — CPU matches PyTorch exactly, CUDA does 
not.
   
   ## Trigger condition
   
   The bug requires `output_padding > 0`. Without `output_padding`, all spatial 
sizes work correctly.
   
   | Input H×W | output_padding | CPU vs CUDA diff | Result |
   |:---------:|:--------------:|:----------------:|:------:|
   | 4×4       | (1,1)          | 1.5e-8           | OK     |
   | 5×5       | (1,1)          | 2.8e-1           | **Wrong** |
   | 6×6       | (1,1)          | 3.5e-1           | **Wrong** |
   | 7×7       | (1,1)          | 3.6e-1           | **Wrong** |
   | 8×8       | (1,1)          | 3.0e-8           | OK     |
   | 9×9       | (1,1)          | 3.0e-1           | **Wrong** |
   | 10×10     | (1,1)          | 3.0e-8           | OK     |
   | 7×7       | (0,0)          | 2.2e-8           | OK     |
   
   Pattern: **fails when H (or W) is not a multiple of stride** (i.e., H % 
stride != 0), and only when `output_padding > 0`.
   
   ## Verification against PyTorch
   
   ```
   CPU  vs PyTorch: abs=4.47e-8  (match)
   CUDA vs PyTorch: abs=3.34e-1  (wrong)
   ```
   
   CPU matches PyTorch exactly. CUDA is the faulty target.
   
   ## Environment
   
   - TVM: main branch (commit `0b0afd8dd`)
   - Target: `cuda`
   - Python: 3.11
   - OS: Ubuntu Linux
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to