junrushao opened a new pull request, #15252: URL: https://github.com/apache/tvm/pull/15252
This PR improves the Decode-GEMV scheduling by further analyzing its epilogue pattern. The existing behavior assumes that the outcome of cross-thread reduction stays in register files local to each thread, which is further used to calculate the epilogue in the same thread. This strategy means the cross-thread reduction outcome is stored only on thread 0, while the other threads cannot participate in subsequent computation (i.e. epilogue). Related: https://github.com/apache/tvm/pull/15192. When the epilogue is relatively lightweight, i.e. elementwise add, casting on scalars, this strategy is optimal. However, once the outcome needs to be broadcasted to compute over a non-trivial region, for example, act as a normalizer of `np.mean`, it would become much slower because only one thread in a thread block is effectively used. In this case, we will need to broadcast the cross-thread reduction outcome in shared memory, making it visible to other threads, and then bind the compute region to all threads in the threadblock. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
