MasterJH5574 opened a new pull request, #15374:
URL: https://github.com/apache/tvm/pull/15374

   This PR fixes the predicate handling logic of the cross-thread reduction 
lowering pass.
   
   For the cross-thread reduction write-back block, prior to this PR, its 
predicate is the conjunction of `t == 0` for each reduction thread dim of the 
cross-thread reduction. This is problematic when the write-back buffer is 
stored in local memory, where each thread is supposed to have a copy of the 
final value, while the final value is only stored by the first thread. In this 
PR, the predicate is changed to be the conjunction of the clauses from the two 
parts:
   
   * the clause of the original reduction block's predicate which contains 
spatial loop var,
   * `t == 0` for each reduction thread dim **only when the write-back buffer 
is global or shared**.
   
   So the first part ensures that the write-back will not go out of bound, and 
the second part ensures that when the write-back buffer is local, every thread 
gets a value and when the write-back buffer is non-local, only one thread 
writes the value out.
   
   Meanwhile, this PR fixes the cross-thread broadcasting detection with the 
awareness of the storage scope of the write buffer of the broadcasting block. 
Specifically, for each consumer block of a buffer produced by cross-thread 
reduction under the same kernel (i.e., same set of `blockIdx`) of the 
cross-thread reduction block, when the write buffer of this consumer block is 
in local memory, we do not treat it as broadcasting, and will not add a 
predicate to it. Otherwise, we will add the predicate according to the 
broadcasting handling introduced by #15192.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to