wrongtest-intellif opened a new pull request, #12174:
URL: https://github.com/apache/tvm/pull/12174

   Hi there, this PR wants to introduce a new TIR schedule primitive 
`schd.decompose_padding(block, loop)`.
   
   For padded `conv` or `pooling` ops, there is a typical padding pattern:
   
   `Pad[_] = T.if_then_else(pad_predicate, X[_], pad_value)` 
   
   which could be decomposed into two parts:
   - One filling pad values with `memset` routine
   - One filling in-bound values with `memcpy` routine
   
   The primitive's signature is alike `decompose_reduction`, which provides a 
target block and a loop position to insert newly created "init" block. It is 
helpful for infrastructures with high-performance memset/memcpy routines, and 
leverage the complexity to process padding conditions in the main compute block.
   
   ### Example
   - Before
   ```python
   @T.prim_func
   def before_decompose(x: T.Buffer[128, "int32"], y: T.Buffer[140, "int32"]):
       for i in range(140):
           with T.block("block"):
               vi = T.axis.remap("S", [i])
               y[vi] = T.if_then_else(vi >= 6 and vi < 134, x[vi - 6], 0, 
dtype="int32")
   ```
   
   - After  `decompose_padding(block, i)`
   ```python
   @T.prim_func
   def after_decompose(x: T.Buffer[128, "int32"], y: T.Buffer[140, "int32"]):
       for i in T.serial(140):
           with T.block("block_pad_const"):
               vi = T.axis.spatial(140, i)
               y[vi] = 0
       for i in T.serial(128):
           with T.block("block"):
               vi = T.axis.spatial(128, i)
               y[vi + 6] = x[vi]
   ```
   
   ### Alternatives and drawbacks
   - From the graph perspective, one may be able to totally fold out the block 
which pad the input buffer. While the primitive seems to be more useful when 
one wants to perform padding in the intra-primfunc buffers.
   
   - One could also `compute-inline` the block perform padding, this introduces 
conditions in the main computation block, which may or may-not get optimized 
well, depending on the concrete target.
   
   - Currently there are schedule ability limitations on created blocks after 
decomposition. They can not be then `compute-at`ed or `compute-inline`d. 
Because multiple blocks write to the same buffer break the stage pipeline 
property.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to