[GitHub] [tvm] LeiWang1999 commented on pull request #14329: [TensorIR] Support for L2 prefetch async copy and pred_guard enabled async in vectorized if_then_else

via GitHub Tue, 21 Mar 2023 22:13:41 -0700


LeiWang1999 commented on PR #14329:
URL: https://github.com/apache/tvm/pull/14329#issuecomment-1478927917


   ## Microbenchmark
   
   - Test Device: A100 ( because async copy works better on devices with huge 
bandwith like the A100 or H100 gpu
   - CUDA Version: 12.0
   - Tested DIffusion Conv2d shapes
   - Tunner: without tuner ( scheduled hands on to see performance influence. 
schedule is not optimal.
   - diffusion conv2d benchmark of vectorized if_then_else async 
copy(nhwc_nhwc, fp16 precison, tensorcore enabled.).
   
   |      | N    | C    | H    | W    | CO   | K    | S    | D    | P    |
   | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
   | C8   | 2    | 640  | 64   | 64   | 640  | 3    | 1    | 1    | 1    |
   | C11  | 2    | 960  | 32   | 32   | 640  | 3    | 1    | 1    | 1    |
   | C13  | 2    | 1280 | 32   | 32   | 1280 | 3    | 1    | 1    | 1    |
   
   performance (weight vectorized load is in async way in both cases)
   
   |      | without vectorized if_then_else async (ms) | with vectorized 
if_then_else async (ms) |
   | ---- | ------------------------------------------ | 
--------------------------------------- |
   | C8   | 0.779605                                   | 0.543061               
                 |
   | C11  | 0.544085                                   | 0.356011               
                 |
   | C13  | 0.267264                                   | 0.218794               
                 |
   
   - gemm to see l2 cache intrin influnce (float16-float16-row-col-tensorcore)
   
   |       | M     | N     | K     | without :l2 (ms) | with :l2 (ms) |
   | ----- | ----- | ----- | ----- | ---------------- | ------------- |
   | GEM-0 | 256   | 256   | 256   | 0.020821         | 0.020821      |
   | GEM-1 | 16384 | 16384 | 16384 | 45.1103          | 45.1103       |
   
   : The L2 performance was as expected, as in my previous tests I didn't 
really see any impact on performance. I leveraged the L2 Cache in a different 
way, and I will wait until it was ready before submitting another pull request.
   
   Eventhough this :l2 feature has no effect in tests, I think it is still 
necessary to add such a feature because sota library like cutlass/cublas ‘s 
kernel enbaled such?
   
   
![image](https://user-images.githubusercontent.com/34334180/226807917-8ea44649-9a82-4af6-8cf7-f657761abf13.png)
   
   This may need more discussions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] LeiWang1999 commented on pull request #14329: [TensorIR] Support for L2 prefetch async copy and pred_guard enabled async in vectorized if_then_else

Reply via email to