mbrookhart opened a new pull request #7611:
URL: https://github.com/apache/tvm/pull/7611


   Building on @masahi's work to add a while loop to TIR, this PR implements 
the [MergePath](https://arxiv.org/abs/1406.2628) algorithm to parallelize the 
later stages of mergesort. It also implements a stable in-shared-memory 
quicksort to do small block sorting before mergesort for better speed. All 
told, this optimization gives us dramatic speedups on both AMD and Nvidia GPUs, 
see below. cc @Laurawly @zhiics @icemelon9 @csullivan @tkonolige 
   
   I had to add a path for skipping thread planning to get handed-threaded 
odd-even-tranpose sort to work, cc @tqchen @junrushao1994 
   
   ```
                      thrust_R9-Nano  mergepath_R9-Nano  thrust_1070Ti   
mergepath_1070Ti     thrust_V100  mergepath_V100
   [1000, 1, 1]              0.60 ms            0.15 ms        0.05 ms          
  0.05 ms         0.07 ms         0.04 ms
   [1, 1000, 1]              0.59 ms            0.14 ms        0.05 ms          
  0.05 ms         0.07 ms         0.04 ms
   [1, 1, 1000]              0.71 ms            0.13 ms        0.04 ms          
  0.05 ms         0.06 ms         0.03 ms
   [1000, 10, 10]            1.65 ms            0.25 ms        0.96 ms          
  0.27 ms         1.81 ms         0.08 ms
   [10, 1000, 10]            1.65 ms            0.25 ms        0.80 ms          
  0.26 ms         1.64 ms         0.08 ms
   [10, 10, 1000]            1.63 ms            0.23 ms        0.95 ms          
  0.25 ms         1.82 ms         0.07 ms
   [1000, 100, 100]         29.29 ms           18.59 ms       47.15 ms          
 33.82 ms        18.74 ms         7.43 ms
   [100, 1000, 100]         28.63 ms           16.35 ms       37.36 ms          
 23.57 ms        17.44 ms         5.28 ms
   [100, 100, 1000]         29.62 ms           15.13 ms       36.33 ms          
 21.11 ms        18.04 ms         4.64 ms
   [10000, 1, 1]             0.78 ms            0.25 ms        0.14 ms          
  0.10 ms         0.46 ms         0.08 ms
   [1, 10000, 1]             0.79 ms            0.25 ms        0.14 ms          
  0.10 ms         0.45 ms         0.08 ms
   [1, 1, 10000]             0.76 ms            0.24 ms        0.13 ms          
  0.09 ms         0.45 ms         0.08 ms
   [10000, 10, 10]           5.27 ms            2.68 ms        4.58 ms          
  3.36 ms         3.78 ms         0.78 ms
   [10, 10000, 10]           5.03 ms            2.56 ms        4.47 ms          
  3.33 ms         3.78 ms         0.73 ms
   [10, 10, 10000]           5.34 ms            2.48 ms        4.34 ms          
  3.08 ms         3.76 ms         0.70 ms
   [10000, 100, 100]   Out of memory          338.65 ms      467.08 ms          
437.17 ms       145.14 ms        96.20 ms
   [100, 10000, 100]   Out of memory          284.24 ms      410.39 ms          
380.18 ms       134.70 ms        84.63 ms
   [100, 100, 10000]   Out of memory          250.24 ms      367.42 ms          
305.95 ms       111.16 ms        59.69 ms
   [100000, 1, 1]            0.73 ms            0.51 ms        0.16 ms          
  0.40 ms         0.50 ms         0.14 ms
   [1, 100000, 1]            0.73 ms            0.51 ms        0.27 ms          
  0.40 ms         0.66 ms         0.14 ms
   [1, 1, 100000]            0.71 ms            0.50 ms        0.23 ms          
  0.39 ms         0.69 ms         0.13 ms
   [100000, 10, 10]         26.64 ms           33.12 ms       39.23 ms          
 39.64 ms        18.65 ms         8.48 ms
   [10, 100000, 10]         26.80 ms           30.91 ms       37.43 ms          
 37.88 ms        17.74 ms         7.67 ms
   [10, 10, 100000]         25.46 ms           29.75 ms       36.33 ms          
 35.59 ms        16.49 ms         7.17 ms
   [1000000, 1, 1]           1.91 ms            3.68 ms        1.18 ms          
  4.36 ms         1.00 ms         1.01 ms
   [1, 1000000, 1]           1.91 ms            3.68 ms        1.23 ms          
  4.32 ms         1.00 ms         1.01 ms
   [1, 1, 1000000]           1.91 ms            3.64 ms        1.15 ms          
  4.34 ms         1.01 ms         0.99 ms
   [1000000, 10, 10]   Out of memory          449.20 ms      410.63 ms          
499.69 ms       136.20 ms       115.64 ms
   [10, 1000000, 10]   Out of memory          385.10 ms      366.33 ms          
443.03 ms       116.31 ms        94.27 ms
   [10, 10, 1000000]   Out of memory          370.27 ms      364.14 ms          
419.95 ms      111.16 ms         87.82 ms
   [(4507,), 0]              0.70 ms            0.21 ms        1.98 ms          
  4.32 ms         0.46 ms         0.06 ms
   [(1, 122640), 1]          0.78 ms            0.53 ms        0.17 ms          
  0.45 ms         0.70 ms         0.14 ms
   [(1, 120000), 1]          0.73 ms            0.52 ms        0.28 ms          
  0.44 ms         0.69 ms         0.14 ms
   [(1, 30000), 1]           0.74 ms            0.28 ms        0.16 ms          
  0.14 ms         0.46 ms         0.09 ms
   [(1, 7500), 1]            0.73 ms            0.21 ms        0.13 ms          
  0.08 ms         0.46 ms         0.07 ms
   [(1, 1000), 1]            0.65 ms            0.13 ms        0.14 ms          
  0.07 ms         0.07 ms         0.04 ms
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to