gnguralnick opened a new issue, #18885:
URL: https://github.com/apache/tvm/issues/18885
## Bug Description
After PR #18865 (`[REFACTOR][TIR] Introduce AllocBuffer and phase out
Allocate+DeclBuffer`), compiling models that use `cumsum` (e.g., via
`gpu_2d_continuous_cumsum`) fails with:
```
tvm.error.InternalError: Check failed: undefined.size() == 0 (1 vs. 0) :
In PrimFunc gpu_2d_continuous_cumsum1 variables [ceil_log2] are used,
but are not passed in as API arguments
```
The error occurs in `make_packed_api.cc` during the final lowering stage.
## Root Cause
In `python/tvm/relax/backend/gpu_generic/cumsum.py`, the `cumsum` PrimFunc
(line ~156) has:
```python
@T.prim_func(private=True)
def cumsum(var_a: T.handle, var_out: T.handle):
T.func_attr({"tir.is_scheduled": True})
m, n = T.int64(), T.int64()
A = T.match_buffer(var_a, [m, n], dtype=in_dtype)
Out = T.match_buffer(var_out, [m, n], dtype=out_dtype)
Tmp = T.sblock_alloc_buffer([m, n], dtype=out_dtype) # changed by
#18865
ceil_log2 = T.Cast("int64", T.ceil(T.log2(T.Cast("float32", n))))
total_rounds = ceil_log2 // LOG_BLOCK_N
...
```
PR #18865 changed `T.alloc_buffer` → `T.sblock_alloc_buffer` for `Tmp`.
Previously, `T.alloc_buffer` created an `Allocate(var, ..., DeclBuffer(buf,
body))` node that nested subsequent statements inside its body, so `ceil_log2`
was scoped within the allocation. Now, `T.sblock_alloc_buffer` adds the buffer
to `SBlock.alloc_buffers` and does not create a nesting body, so `ceil_log2`
becomes a top-level `Bind` statement. `make_packed_api` then fails because
`ceil_log2` is used but not recognized as a function parameter.
## Steps to Reproduce
Compile any model that triggers `gpu_2d_continuous_cumsum` legalization on a
GPU target (e.g., Metal). For example, compiling a Gemma 3 model with mlc-llm:
```bash
python -m mlc_llm compile <model-path> --device metal
```
## Environment
- TVM commit: d5fd1c7c1 (post-#18865)
- Platform: macOS (Apple Silicon), Metal target
- Discovered via mlc-llm model compilation
## Suggested Fix
The `ceil_log2` variable (and `total_rounds`) need to be handled differently
now that `Tmp` uses `sblock_alloc_buffer`. One option is to use
`T.alloc_buffer` (statement-level `AllocBuffer`) for `Tmp` instead, which would
restore the body-nesting behavior. Alternatively, `ceil_log2` could be wrapped
in a proper scope.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]