quic-sanirudh commented on PR #14161:
URL: https://github.com/apache/tvm/pull/14161#issuecomment-1451610544

   > Hi @quic-sanirudh , this schedule allocates a buffer with the same volume 
as the Cartesian product of block itervar domains, regardless of the original 
buffer size, and will not perform the check because those data are not used in 
current block.
   
   Thanks for the reply @yzh119. If it does not perform correctness checks and 
generates the loop based on the surrounding itervar domains, then couldn't that 
lead to potentially incorrect code?
   
   For example: Applying `reindex_cache_read` on the below modification of your 
initial example would probably lead to incorrect code:
   
   ```python
   @T.prim_func
   def func(a: T.handle, b: T.handle, c: T.handle) -> None:
       A = T.match_buffer(a, (129, 129))
       B = T.match_buffer(b, (128, 128))
       C = T.match_buffer(c, (129, 129))
       for i, j in T.grid(128, 128):
           with T.block("B"):
               vi, vj = T.axis.remap("SS", [i, j])
               B[vi, vj] = A[vi + 1, vj + 1] * 2.0
       for i, j in T.grid(129, 129):
           with T.block("C"):
               vi, vj = T.axis.remap("SS", [i, j])
               C[vi, vj] = A[vi, vj] * 3.0
   ```
   
   When we apply `sch.reindex_cache_read("B", 0, "shared", lambda i, j: (j, 
i))`, we get the below IR:
   
   ```python
   # from tvm.script import ir as I
   # from tvm.script import tir as T
   
   @I.ir_module
   class Module:
       @T.prim_func
       def main(A: T.Buffer((129, 129), "float32"), B: T.Buffer((128, 128), 
"float32"), C: T.Buffer((129, 129), "float32")):
           # with T.block("root"):
           A_shared = T.alloc_buffer((128, 128), scope="shared")
           for i, j in T.grid(128, 128):
               with T.block("A_shared"):
                   vi, vj = T.axis.remap("SS", [i, j])
                   T.reads(A[vi + 1, vj + 1])
                   T.writes(A_shared[vj, vi])
                   A_shared[vj, vi] = A[vi + 1, vj + 1]
           for i, j in T.grid(128, 128):
               with T.block("B"):
                   vi, vj = T.axis.remap("SS", [i, j])
                   T.reads(A_shared[vj, vi])
                   T.writes(B[vi, vj])
                   B[vi, vj] = A_shared[vj, vi] * T.float32(2)
           vj_1 = T.int32()
           vi_1 = T.int32()
           for i, j in T.grid(129, 129):
               with T.block("C"):
                   vi, vj = T.axis.remap("SS", [i, j])
                   T.reads(A_shared[vj_1, vi_1])
                   T.writes(C[vi, vj])
                   C[vi, vj] = A_shared[vj_1, vi_1] * T.float32(3)
   ```
   
   Note that the iteration domain of block `"A_shared"` is `(128, 128)`, but 
when `A_shared` is used in block `"C"` the iteration domain is `(129,129)`. 
This means the values of `C` from `C[0,0]` to `C[0,128]` and `C[i,0]` for `0 <= 
i <= 128` would be invalid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to