Hi All,

Would it make sense to add a "top()" method to the Dataset API?
This method would return a Dataset containing the top k elements, the caller
may then do further processing on the Dataset or call collect(). This is in
contrast with RDD's top() which returns a collected array.

In terms of implementation, this would use a bounded priority queue, which
will avoid sorting all elements and run in O(n log k).

I know something similar can be achieved by "orderBy().take()", but I am not
sure if this is optimized.
If that's not the case, and it performs sorting of all elements (therefore
running in n log n), it might be handy to add this method.

What do you think?

Regards,
Yacine.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to