wuyii8941 opened a new issue, #19524:
URL: https://github.com/apache/tvm/issues/19524
# [Bug][Relax] `conv2d_transpose` produces wrong results on CUDA when
`output_padding > 0`
## Reproduction
```python
import numpy as np
import tvm
from tvm import relax
import tvm.relax.op as R
from tvm.s_tir import dlight
bb = relax.BlockBuilder()
x = relax.Var("x", relax.TensorStructInfo((1, 16, 7, 7), "float32"))
w = relax.Var("w", relax.TensorStructInfo((16, 3, 3, 3), "float32"))
with bb.function("main", [x, w]):
with bb.dataflow():
out = bb.emit(R.nn.conv2d_transpose(x, w,
strides=(2, 2), padding=(1, 1), output_padding=(1, 1)))
gv = bb.emit_output(out)
bb.emit_func_output(gv)
mod = bb.get()
np.random.seed(0)
x_np = np.random.randn(1, 16, 7, 7).astype("float32")
w_np = np.random.randn(16, 3, 3, 3).astype("float32") * 0.01
# CPU
pipeline_cpu = tvm.ir.transform.Sequential([relax.transform.LegalizeOps()])
exe_cpu = tvm.relax.build(pipeline_cpu(mod), target="llvm")
vm_cpu = tvm.relax.VirtualMachine(exe_cpu, device=tvm.cpu())
out_cpu = vm_cpu["main"](
tvm.runtime.tensor(x_np, device=tvm.cpu()),
tvm.runtime.tensor(w_np, device=tvm.cpu())).numpy()
# CUDA
pipeline_cuda = tvm.ir.transform.Sequential([
relax.transform.LegalizeOps(),
dlight.ApplyDefaultSchedule(dlight.gpu.Fallback()),
])
with tvm.target.Target("cuda"):
mod_cuda = pipeline_cuda(mod)
exe_cuda = tvm.relax.build(mod_cuda, target="cuda")
vm_cuda = tvm.relax.VirtualMachine(exe_cuda, device=tvm.cuda())
out_cuda = vm_cuda["main"](
tvm.runtime.tensor(x_np, device=tvm.cuda()),
tvm.runtime.tensor(w_np, device=tvm.cuda())).numpy()
print(f"max |CPU - CUDA| = {np.abs(out_cpu - out_cuda).max():.6e}")
# Expected: ~1e-7 (float rounding)
# Actual: ~3e-1 (wrong result)
```
## Expected behavior
`conv2d_transpose` should produce the same result on CPU and CUDA (within
floating-point tolerance).
## Actual behavior
CUDA output is silently wrong. The maximum absolute difference is ~0.3
(relative error >100%), confirmed by comparing against PyTorch
`F.conv_transpose2d` as ground truth — CPU matches PyTorch exactly, CUDA does
not.
## Trigger condition
The bug requires `output_padding > 0`. Without `output_padding`, all spatial
sizes work correctly.
| Input H×W | output_padding | CPU vs CUDA diff | Result |
|:---------:|:--------------:|:----------------:|:------:|
| 4×4 | (1,1) | 1.5e-8 | OK |
| 5×5 | (1,1) | 2.8e-1 | **Wrong** |
| 6×6 | (1,1) | 3.5e-1 | **Wrong** |
| 7×7 | (1,1) | 3.6e-1 | **Wrong** |
| 8×8 | (1,1) | 3.0e-8 | OK |
| 9×9 | (1,1) | 3.0e-1 | **Wrong** |
| 10×10 | (1,1) | 3.0e-8 | OK |
| 7×7 | (0,0) | 2.2e-8 | OK |
Pattern: **fails when H (or W) is not a multiple of stride** (i.e., H %
stride != 0), and only when `output_padding > 0`.
## Verification against PyTorch
```
CPU vs PyTorch: abs=4.47e-8 (match)
CUDA vs PyTorch: abs=3.34e-1 (wrong)
```
CPU matches PyTorch exactly. CUDA is the faulty target.
## Environment
- TVM: main branch (commit `0b0afd8dd`)
- Target: `cuda`
- Python: 3.11
- OS: Ubuntu Linux
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]