Hello! LW> Worth noting: Guava uses a similar implementation for LW> Ordering.leastOf, but instead of sorting the array when it's LW> filled, does a quickselect pass to do it in O(k) time instead of O(k log k).
Thank you for mentioning Guava. Unfortunately quickselect is not stable, while sorted().limit(n) must produce stable result. Quickselect might be good for primitive specializations though. With best regards, Tagir Valeev. LW> We had been planning to put together a Collector implementation LW> for it, since it's actually pretty amenable to Collectorification (clearly a word). LW> On Sat, Mar 5, 2016 at 12:32 PM Tagir F. Valeev <amae...@gmail.com> wrote: LW> Hello! LW> LW> One of the popular bulk data operation is to find given number of LW> least or greatest elements. Currently Stream API provides no dedicated LW> operation to do this. Of course, it could be implemented by custom LW> collector and some third-party libraries already provide it. However LW> it would be quite natural to use existing API: LW> LW> stream.sorted().limit(k) - k least elements LW> stream.sorted(Comparator.reverseOrder()).limit(k) - k greatest elements. LW> LW> In fact people already doing this. Some samples could be found on LW> GitHub: LW> LW> https://github.com/search?l=java&q=%22sorted%28%29.limit%28%22&type=Code&utf8=%E2%9C%93 LW> LW> Unfortunately current implementation of such sequence of operations is LW> suboptimal: first the whole stream content is dumped into intermediate LW> array, then sorted fully and after that k least elements is selected. LW> On the other hand it's possible to provide a special implementation LW> for this particular case which takes O(k) additional memory and in LW> many cases works significantly faster. LW> LW> I wrote proof-of-concept implementation, which could be found here: LW> http://cr.openjdk.java.net/~tvaleev/patches/sortedLimit/webrev/ LW> The implementation switches to new algorithm if limit is less than LW> 1000 which is quite common for such scenario (supporting bigger values LW> is also possible, but would require more testing). New algorithm LW> allocates an array of 2*limit elements. When its size is reached, it LW> sorts the array (using Arrays.sort) and discards the second half. LW> After that only those elements are accumulated which are less than the LW> worst element found so far. When array is filled again, the second LW> half is sorted and merged with the first half. LW> LW> Here's JMH test with results which covers several input patterns: LW> http://cr.openjdk.java.net/~tvaleev/patches/sortedLimit/jmh/ LW> LW> You may check summary first: LW> LW> http://cr.openjdk.java.net/~tvaleev/patches/sortedLimit/jmh/summary.txt LW> Speedup values bigger than 1 are good. LW> LW> The most significant regression in the sequential mode of the new LW> implementation is the ever decreasing input (especially with the low LW> limit value). Still, it's not that bad (given the fact that old LW> implementation processes such input very fast). On the other hand, for LW> random input new implementation could be in order of magnitude faster. LW> Even for ever ascending input noteable speedup (like 40%) could be LW> achieved. LW> LW> For parallel stream the new implementation is almost always faster, LW> especially if you ignore the cases when parallel stream is LW> unprofitable. LW> LW> What do you think about this improvement? Could it be included into LW> JDK-9? Are there any issues I'm unaware of? I would be really happy to LW> complete this work if this is supported by JDK team. Current LW> implementation has no primitive specialization and does not optimize LW> the sorting out if the input is known to be sorted, but it's not very LW> hard to add these features as well if you find my idea useful. LW> LW> With best regards, LW> Tagir Valeev. LW> LW>