[SQL] [Suggestion] Add top() to Dataset

Yacine Mazari Tue, 30 Jan 2018 23:11:28 -0800

Hi All,

Would it make sense to add a "top()" method to the Dataset API?
This method would return a Dataset containing the top k elements, the caller
may then do further processing on the Dataset or call collect(). This is in
contrast with RDD's top() which returns a collected array.


In terms of implementation, this would use a bounded priority queue, which
will avoid sorting all elements and run in O(n log k).

I know something similar can be achieved by "orderBy().take()", but I am not
sure if this is optimized.
If that's not the case, and it performs sorting of all elements (therefore
running in n log n), it might be handy to add this method.

What do you think?

Regards,
Yacine.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[SQL] [Suggestion] Add top() to Dataset

Reply via email to