AndrewZhaoLuo opened a new pull request, #13599: URL: https://github.com/apache/tvm/pull/13599
This is a simple rewrite of hand-coded top-k function used for CPU targets. The old implementation sorted each axis and then took the biggest k elements. The new implementation does a single pass of each axis, keeping a min heap to store the top-k elements up to that point. If n is the size of the array, and we want to find top k, the old implementation has runtime in O(nlogn) with additional memory O(n) to store the sorted array. The new implementation is O(n log k), and in practice is probably amortized to O(n / k * log k) in many scenarios and only requires O(k). Note n >> k. In practice this new kernel led to a 20x speedup over existing one. On a Xeon Platinum 8370C CPU @ 2.80GHz for input shape [1, 3050] with k = 15, the latency went from 300us --> ~10us. There is probably more room for shaving off a little more time on the scale of a single us's, however I have determined it to not be worth it. This change however is probably in the range of worth comitting. TODO: Wider benchmark suite over many shapes + k + hardware platforms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
