junrushao1994 commented on PR #12019:
URL: https://github.com/apache/tvm/pull/12019#issuecomment-1176764243

   Moving our discussion thread from #11428. Thanks @ArmageddonKnight for 
spotting this nasty corner case, and @comaniac for offline discussion :-)
   
   @junrushao1994:
   
   > This PR introduces a regression with those lines: 
https://github.com/apache/tvm/blob/d4be49aec62299275565066b56a0555bafc2ccac/src/tir/transforms/compact_buffer_region.cc#L77-L85
   > 
   > Consider a case below:
   > 
   > ```python
   > extent = 960 - max(bx // 18 * 128, 832)
   > where bx \in [0, 144)
   > ```
   > 
   > Even if `bx` exists in `extent`, the upper bound of the allocation is not 
affected (which is 128). However, the check there introduces a false positive 
which falls back to a much more conservative estimate ([full 
region](https://github.com/apache/tvm/blob/d4be49aec62299275565066b56a0555bafc2ccac/src/tir/transforms/compact_buffer_region.cc#L77-L85)).
   
   @junrushao1994 
   
   > A minimal reproducible example: 
https://gist.github.com/junrushao1994/943eb0e9232a21a2748eb35af0fac4ff
   > 
   > The region of `A_shared` should be [128, 4] instead of [192, 4]
   
   @junrushao1994 
   
   > so let me briefly introduce the workflow of Compute-Buffer-Region first, 
and then let’s think about how to make it perfectly fit in our usecase.
   > 
   > On each use-site of a buffer, we calculate the region it is touched by the 
loops above it, and then by unionizing those regions, it’s made possible to 
trim out those regions that are not touched.
   > 
   > As a simplest example, if the original size of a buffer in shared memory 
is `[1024, 1024]`, but only `[bx : bx + 10, by : by + 20]` is touched in a 
threadblock, then the buffer size could shrink to `[10, 20]` which is 
significantly smaller.
   > 
   > Then to provide a slightly more complicated example, if the buffer is 
touched in two different places, where the region are `[bx : bx + 10 : by : by 
+ 20]`, `[bx + 10: bx + 20, by : by + 30]`, then we calculate the union of the 
two regions, i.e. `[bx : bx + 20, by : by + 30]`, and then do shrinking so that 
its size of `[20, 30]`.
   > 
   > The problem this PR aims to address is that some buffer’s size actually 
depends on irrelevant outer variables, for example, the shared memory size to 
be `[0, 960 - max(bx // 18 * 128, 832)]`, where `bx` is the extent of 
`blockIdx.x`. However, it takes an approach to over-relax the region when 
seeing any undesired variables (in our case, it’s `bx`), which is less optimal.
   > 
   > The actual solution should be: calculating the region union bound first, 
and then remove the undesired variables with something similar to analyzer's 
`const_int_bound` (note that we don’t want the bound to be strictly constant)
   
   @wrongtest
   
   > > calculating the region union bound first
   >
   > Many thanks!  It indeed should check loop dependency for union region. 
   > 
   > https://gist.github.com/junrushao1994/943eb0e9232a21a2748eb35af0fac4ff
   > In the case it seems we get extent=128 for `A_shared[128, 4]` after fix, 
even without const bound estimation... Perhaps the loop dependent region parts 
get covered during region unions?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to