Lunderberg commented on PR #16938:
URL: https://github.com/apache/tvm/pull/16938#issuecomment-2080617986

   Agreed, there's a tradeoff, and replacing copies with views isn't something 
to be applied in every case.
   
   * Kernel Compatibility
   
     Currently, the default `tir::Buffer` used by TE (used by topi, and used in 
most Relax legalizations) is defined with `elem_offset` of zero.
     
     * Option 1: Change TE's default to use arbitrary `elem_offset` for the 
default `tir::Buffer`.  (Change the default of `elem_offset` from zero to a 
fresh TIR variable.)  Not a good option, as this would provide a weaker 
alignment check that existing kernels may assume.
     
     * Option 2: Change TE's default to use a dynamic `elem_offset` that 
preserves alignment for the default `tir::Buffer`. (Change the default of 
`elem_offset` from zero to a fresh TIR variable, and change the default of 
`offset_factor` to `kAllocAlignment // dtype.bytes()`.)  This is better than 
option (1), because `data + elem_offset*dtype.bytes()` is still aligned.  
However, still not a good option, as TIR kernels that call external functions 
typically pass `buf.data` as an argument, which doesn't include the 
`elem_offset`.
     
     * Option 3: Keep TE's default the same, and instead apply a 
post-processing step.  For each `tir::Buffer` argument with `elem_offset` of 
zero, replace it with a new `tir::Buffer` with `elem_offset` allowed, and 
`offset_factor` to maintain alignment.  Within the body of the function, add a 
definition of the old `tir::Buffer` argument either as a `MatchBufferRegion` or 
a `attr::buffer_bind_scope`.
   
       I'm liking Option 3 the most, as it adds flexibility at the calling 
scope.  Unlike option 1, it still provides the same alignment guarantees for 
kernels.  Unlike option 2, it wouldn't require re-writing a very large number 
of existing kernels.
     
   * Performance
   
       * Zero-Copy vs Contiguous Arrays:  The `CreateView` method is only 
allowed to make contiguous views of contiguous arrays.  It is not able to make 
strided views, so this tradeoff isn't applicable.
   
       * Zero-Copy vs Alignment: The `CreateView` method can make views whose 
`data + byte_offset` isn't aligned to 
   `kAllocAlignment`.  The restriction of aligned `data + byte_offset` would 
still be applied at the `PrimFunc` level, so this wouldn't cause a performance 
hit.
   
       * Zero-Copy vs Memory Footprint: The `NDArray` returned by `CreateView` 
shares the same reference-counted allocation as the source array into which it 
views.  If the view outlives the source, then the memory footprint of the view 
is larger than a copy would have been.
       
       Not an issue for the `CreateView` method itself, but should be a 
consideration for optimization passes that replace copies with views.  (e.g. 
replacing `R.take`, `R.split`, or `R.strided_slice` with a view)  Should only 
be done in cases where the source array outlives the view.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to