mitiskuma commented on PR #18871:
URL: https://github.com/apache/tvm/pull/18871#issuecomment-4000493580

   @tqchen ok after more benchmarking we lose about 50% without caching. The 
main issue is that with lazy submission, each dispatch in a batch needs its own 
uniform buffer and bind group. We can't reuse a shared one because 
queue.writeBuffer executes immediately while the compute passes are deferred in 
the encoder.
   We recover some of the loss (~60-70%) by pooling uniform buffers per 
dispatch index and reusing them across flushes. But createBindGroup() per 
dispatch is still expensive and that's where the remaining gap comes from.
   So the caching isn't just a nice to have optimization, it's closely tied to 
making lazy submission performant imo.
   The uniform buffer cache avoids per-dispatch allocation, and the bind group 
cache avoids redundant createBindGroup calls for repeated kernel signatures 
(which is common as you know in transformer inference where the same kernels 
run every token). Would it work to keep the caching in this PR since it's a 
correctness requirement for lazy submission to not regress? Or would you prefer 
we land lazy submission with the pool-based workaround (accepting ~35% 
regression vs full caching)?
   Or we could just abstract caching with a better structure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to