mitiskuma commented on PR #18871:
URL: https://github.com/apache/tvm/pull/18871#issuecomment-3999973080

   @tqchen thanks for the review.
   Here are the benchmark results comparing Apache TVM (main) vs this branch, 
using the same web-llm-baseline code, only the TVM web runtime differs:
     - 48 t/s (baseline) vs 143 t/s (optimized) ~3x speedup on the same model, 
same web-llm-baseline, only TVM runtime differs
   - TTFT dropped from 181ms to 110-130ms
   - Same output, deterministic (temperature=0), also testes with different 
temp, similar % improvements.
   
   Regarding keeping the runtime simple: the caching is fully contained within 
the WebGPU layer and doesn't affect the API surface. The caches are bounded 
(FIFO eviction at 512 uniform buffers and 256 bind groups), so there's no 
unbounded memory growth.
   On eager vs lazy submission.. you're right that eager submission allows host 
compute to overlap with GPU compute. However, during LLM decode the CPU work 
between dispatches is mostly FFI/binding overhead with very little meaningful 
host computation to overlap. The 3x speedup suggests that the per-dispatch 
JS-to-GPU transition cost is the dominant bottleneck here, and batching those 
submissions is a net win. That said, I'm open to making the flush strategy 
configurable if that would be preferred.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to