mitiskuma opened a new pull request, #18877:
URL: https://github.com/apache/tvm/pull/18877
Benchmark results (Metal, M4 Max, MLC-LLM serve, temperature=0):
256 decode tokens:
Qwen2.5-0.5B-Instruct-q4f16_1: 238 t/s -> 466 t/s (1.95x)
Qwen2.5-1.5B-Instruct-q4f16_1: 177 t/s -> 239 t/s (1.35x)
Qwen2.5-3B-Instruct-q4f16_1: 114 t/s -> 139 t/s (1.21x)
Llama-3.1-8B-Instruct-q4f16_1: 76 t/s -> 89 t/s (1.18x)
1024 decode tokens:
Qwen2.5-0.5B-Instruct-q4f16_1: 239 t/s -> 398 t/s (1.67x)
Qwen2.5-1.5B-Instruct-q4f16_1: 137 t/s -> 190 t/s (1.38x)
Qwen2.5-3B-Instruct-q4f16_1: 92 t/s -> 115 t/s (1.25x)
Llama-3.1-8B-Instruct-q4f16_1: 70 t/s -> 80 t/s (1.14x)
Baseline and optimized use the same MLC-LLM, same compiled models, only the
TVM Metal runtime differs. Servers run sequentially (not parallel) to avoid
GPU contention. Each run preceded by 2 warmup requests.
The speedup is larger on smaller models because they are dispatch-bound
(262 dispatches/token for 0.5B vs 394 for 8B). Larger models spend more
time in actual compute, so the per-dispatch overhead is a smaller fraction.
At 1024 tokens the 0.5B speedup drops from 1.95x to 1.67x because KV cache
growth increases per-token compute, shifting the bottleneck toward memory
bandwidth.
What changed:
1. Batched compute dispatch. Kernel dispatches are accumulated in a single
MTLCommandBuffer via a shared MTLComputeCommandEncoder. Previously each
dispatch created its own command buffer and committed immediately. The
pending encoder is flushed on GPU->CPU readback, buffer deallocation,
or stream sync.
2. Inline blit encoders for copies. CPU->GPU and GPU->GPU copies now use
blit encoders on the same pending command buffer instead of creating a
separate command buffer per copy. Metal guarantees sequential ordering
of encoders within a command buffer, so no explicit sync is needed
between compute and copy operations.
3. Staging buffer pool for CPU->GPU copies. Each inlined CPU->GPU copy
needs its own staging buffer because the GPU reads them asynchronously
from the deferred command buffer. A per-device StagingBufferPool hands
out shared-mode buffers and recycles them after flush/sync.
4. Conditional sync in FreeDataSpace. Instead of always calling StreamSync,
we check HasPendingWork() first. When the GPU->CPU readback path has
already flushed and waited, FreeDataSpace can skip the redundant sync.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]