For the DataFrame/Dataset API, the optimizer rewrites orderBy followed by a take into a priority queue based top implementation actually.
On Tue, Jan 30, 2018 at 11:10 PM, Yacine Mazari <y.maz...@gmail.com> wrote: > Hi All, > > Would it make sense to add a "top()" method to the Dataset API? > This method would return a Dataset containing the top k elements, the > caller > may then do further processing on the Dataset or call collect(). This is in > contrast with RDD's top() which returns a collected array. > > In terms of implementation, this would use a bounded priority queue, which > will avoid sorting all elements and run in O(n log k). > > I know something similar can be achieved by "orderBy().take()", but I am > not > sure if this is optimized. > If that's not the case, and it performs sorting of all elements (therefore > running in n log n), it might be handy to add this method. > > What do you think? > > Regards, > Yacine. > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >