Hi All, Would it make sense to add a "top()" method to the Dataset API? This method would return a Dataset containing the top k elements, the caller may then do further processing on the Dataset or call collect(). This is in contrast with RDD's top() which returns a collected array.
In terms of implementation, this would use a bounded priority queue, which will avoid sorting all elements and run in O(n log k). I know something similar can be achieved by "orderBy().take()", but I am not sure if this is optimized. If that's not the case, and it performs sorting of all elements (therefore running in n log n), it might be handy to add this method. What do you think? Regards, Yacine. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org