csullivan opened a new pull request, #16842:
URL: https://github.com/apache/tvm/pull/16842
* Use CCL type traits to share common code between NCCL and MSCCLPP API
invocations in disco
* Add bench to validate results and compare various supported CCL approaches
for cuda.
Aggregated profiling results over the sweep of transfer sizes introduced in
the above mentioned bench 2**(12 -> 24).
```
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max
(ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- --------
-------- -----------
----------------------------------------------------------------------------------------------------
33.2 277809887 4048 68628.9 45504.0 15104
25069360 545729.8 ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm *,
unsigned long, ncclWork *)
24.3 203015590 4040 50251.4 48256.0 12992
315934 29470.2 void tensorrt_llm::twoShotAllReduceKernel<__half,
(int)8>(tensorrt_llm::AllReduceParams)
20.9 174549284 4040 43205.3 39440.0 3711
1085846 52387.0 void tensorrt_llm::oneShotAllReduceKernel<__half,
(int)8>(tensorrt_llm::AllReduceParams)
20.7 173275472 4040 42890.0 40112.0 5375
750653 42864.2 void
tvm::runtime::allreduce_simple<__half>(mscclpp::SmChannelDeviceHandle *, const
T1 *, T1 *, voi…
0.8 6985121 120 58209.3 55871.5 9695
158239 35854.8 ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned
long, ncclWork *)
```
I noted significant variance between runs, so e2e or use of cuda graph
launch for synchronization could help give a clearer picture.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]