junrushao opened a new pull request, #15252:
URL: https://github.com/apache/tvm/pull/15252

   This PR improves the Decode-GEMV scheduling by further analyzing its 
epilogue pattern.
   
   The existing behavior assumes that the outcome of cross-thread reduction 
stays in register files local to each thread, which is further used to 
calculate the epilogue in the same thread.
   
   This strategy means the cross-thread reduction outcome is stored only on 
thread 0, while the other threads cannot participate in subsequent computation 
(i.e. epilogue). Related: https://github.com/apache/tvm/pull/15192.
   
   When the epilogue is relatively lightweight, i.e. elementwise add, casting 
on scalars, this strategy is optimal. However, once the outcome needs to be 
broadcasted to compute over a non-trivial region, for example, act as a 
normalizer of `np.mean`, it would become much slower because only one thread in 
a thread block is effectively used.
   
   In this case, we will need to broadcast the cross-thread reduction outcome 
in shared memory, making it visible to other threads, and then bind the compute 
region to all threads in the threadblock.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to