mbrookhart opened a new pull request #7611: URL: https://github.com/apache/tvm/pull/7611
Building on @masahi's work to add a while loop to TIR, this PR implements the [MergePath](https://arxiv.org/abs/1406.2628) algorithm to parallelize the later stages of mergesort. It also implements a stable in-shared-memory quicksort to do small block sorting before mergesort for better speed. All told, this optimization gives us dramatic speedups on both AMD and Nvidia GPUs, see below. cc @Laurawly @zhiics @icemelon9 @csullivan @tkonolige I had to add a path for skipping thread planning to get handed-threaded odd-even-tranpose sort to work, cc @tqchen @junrushao1994 ``` thrust_R9-Nano mergepath_R9-Nano thrust_1070Ti mergepath_1070Ti thrust_V100 mergepath_V100 [1000, 1, 1] 0.60 ms 0.15 ms 0.05 ms 0.05 ms 0.07 ms 0.04 ms [1, 1000, 1] 0.59 ms 0.14 ms 0.05 ms 0.05 ms 0.07 ms 0.04 ms [1, 1, 1000] 0.71 ms 0.13 ms 0.04 ms 0.05 ms 0.06 ms 0.03 ms [1000, 10, 10] 1.65 ms 0.25 ms 0.96 ms 0.27 ms 1.81 ms 0.08 ms [10, 1000, 10] 1.65 ms 0.25 ms 0.80 ms 0.26 ms 1.64 ms 0.08 ms [10, 10, 1000] 1.63 ms 0.23 ms 0.95 ms 0.25 ms 1.82 ms 0.07 ms [1000, 100, 100] 29.29 ms 18.59 ms 47.15 ms 33.82 ms 18.74 ms 7.43 ms [100, 1000, 100] 28.63 ms 16.35 ms 37.36 ms 23.57 ms 17.44 ms 5.28 ms [100, 100, 1000] 29.62 ms 15.13 ms 36.33 ms 21.11 ms 18.04 ms 4.64 ms [10000, 1, 1] 0.78 ms 0.25 ms 0.14 ms 0.10 ms 0.46 ms 0.08 ms [1, 10000, 1] 0.79 ms 0.25 ms 0.14 ms 0.10 ms 0.45 ms 0.08 ms [1, 1, 10000] 0.76 ms 0.24 ms 0.13 ms 0.09 ms 0.45 ms 0.08 ms [10000, 10, 10] 5.27 ms 2.68 ms 4.58 ms 3.36 ms 3.78 ms 0.78 ms [10, 10000, 10] 5.03 ms 2.56 ms 4.47 ms 3.33 ms 3.78 ms 0.73 ms [10, 10, 10000] 5.34 ms 2.48 ms 4.34 ms 3.08 ms 3.76 ms 0.70 ms [10000, 100, 100] Out of memory 338.65 ms 467.08 ms 437.17 ms 145.14 ms 96.20 ms [100, 10000, 100] Out of memory 284.24 ms 410.39 ms 380.18 ms 134.70 ms 84.63 ms [100, 100, 10000] Out of memory 250.24 ms 367.42 ms 305.95 ms 111.16 ms 59.69 ms [100000, 1, 1] 0.73 ms 0.51 ms 0.16 ms 0.40 ms 0.50 ms 0.14 ms [1, 100000, 1] 0.73 ms 0.51 ms 0.27 ms 0.40 ms 0.66 ms 0.14 ms [1, 1, 100000] 0.71 ms 0.50 ms 0.23 ms 0.39 ms 0.69 ms 0.13 ms [100000, 10, 10] 26.64 ms 33.12 ms 39.23 ms 39.64 ms 18.65 ms 8.48 ms [10, 100000, 10] 26.80 ms 30.91 ms 37.43 ms 37.88 ms 17.74 ms 7.67 ms [10, 10, 100000] 25.46 ms 29.75 ms 36.33 ms 35.59 ms 16.49 ms 7.17 ms [1000000, 1, 1] 1.91 ms 3.68 ms 1.18 ms 4.36 ms 1.00 ms 1.01 ms [1, 1000000, 1] 1.91 ms 3.68 ms 1.23 ms 4.32 ms 1.00 ms 1.01 ms [1, 1, 1000000] 1.91 ms 3.64 ms 1.15 ms 4.34 ms 1.01 ms 0.99 ms [1000000, 10, 10] Out of memory 449.20 ms 410.63 ms 499.69 ms 136.20 ms 115.64 ms [10, 1000000, 10] Out of memory 385.10 ms 366.33 ms 443.03 ms 116.31 ms 94.27 ms [10, 10, 1000000] Out of memory 370.27 ms 364.14 ms 419.95 ms 111.16 ms 87.82 ms [(4507,), 0] 0.70 ms 0.21 ms 1.98 ms 4.32 ms 0.46 ms 0.06 ms [(1, 122640), 1] 0.78 ms 0.53 ms 0.17 ms 0.45 ms 0.70 ms 0.14 ms [(1, 120000), 1] 0.73 ms 0.52 ms 0.28 ms 0.44 ms 0.69 ms 0.14 ms [(1, 30000), 1] 0.74 ms 0.28 ms 0.16 ms 0.14 ms 0.46 ms 0.09 ms [(1, 7500), 1] 0.73 ms 0.21 ms 0.13 ms 0.08 ms 0.46 ms 0.07 ms [(1, 1000), 1] 0.65 ms 0.13 ms 0.14 ms 0.07 ms 0.07 ms 0.04 ms ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
