[GitHub] [tvm] comaniac opened a new pull request #8457: [CUDA] Improve injective schedule to enable half2

GitBox Mon, 12 Jul 2021 17:09:08 -0700


comaniac opened a new pull request #8457:
URL: https://github.com/apache/tvm/pull/8457



   Per discussion in 
https://discuss.tvm.apache.org/t/cuda-enable-half2-in-cuda-injective-schedule/10441,
 this PR improves the CUDA injective schedule to benefit more from `half2` when 
working on `float16`.
   
   The background is that although the CUDA injective schedule does vectorize 
the innermost loop when working on float16, the vectorization may fail due to 
the if-conditions introduced by non-dividable workload and block/thread sizes. 
Formally, vectorization requires `prod(output_shape) % block % thread % 
vector_width == 0`. To make sure vectorization is effective, this PR adjusts 
the block and thread sizes accordingly (see the code change for details).
   
   On the other hand, when the output shapes are weird (e.g., prime numbers), 
the selected block and thread sizes may be too small. For example, if the 
output shape is `(311, 3814)`, then factors are `(1, 2, 311, 1907, 3814)`. As a 
result, we may select `(block, thread) = (2, 311)` with the consideration of 
the maximum `(block, thread) = (256, 1024)`. In this case, we don't utilize the 
compute resources well even `half2` is enabled.
   
   Ideally, we should pad the output to let the factors always be power of two, 
but it is too complicate and may introduce other issues. Accordingly, another 
heuristic introduced by this PR is that when `(select_block * select_thread) / 
(max_block * max_thread) < R`, then we don't apply the change and let the 
vectorization failed.
   
   Here is the evaluation results when `R=0.7`.
   * Workloads: FP32Mul_FP16Add, FP16Mul_FP16Add, FP16Mul, Cast, 
FP32Mul_FP32Add, FP32Mul.
   * Output shapes: I manually assigned two shapes (768, 3072), (1, 1000) and 
randomly generated additional 100 shapes ranging from 1 to 4096.
   * Platform: NVIDIA T4 and V100.
   
   For each platform, I displayed the worst, the best, and the average speedup 
of all workloads over the current upstream.
   
   * T4: Worst 0.98x, Best, 1.41x, Average 1.12x.
   * V100: Worst 0.97x, Best 1.33x, Average 1.15x.
   
   cc @vinx13 @wpan11nv @Laurawly @masahi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] comaniac opened a new pull request #8457: [CUDA] Improve injective schedule to enable half2

Reply via email to