[jira] [Commented] (CALCITE-3920) Improve ORDER BY computation in Enumerable convention by exploiting LIMIT

Thomas Rebele (Jira) Thu, 30 Jul 2020 11:43:09 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168140#comment-17168140
 ]


Thomas Rebele commented on CALCITE-3920:
----------------------------------------

[~zabetak] had mentioned the benchmarks that he had started on this issue. They 
showed that my heap implementation is in some cases much slower than a simple 
TreeMap based approach. I improved the heap a bit, but still, the results would 
not justify a custom heap implementation.

So I had the idea of a new algorithm which is optimized for the limit sort use 
case (called LimitSort for the time being):
 * Collect elements until the limit in an array, then sort.
 * Ignore new elements greater than the biggest element.
 * For smaller elements, search the insertion position with binary search, and 
put them into a separate list, remembering the insertion position. Remove the 
biggest element.
 * If we have no known biggest elements left in the array (because it is in the 
separate list), sort the list, then merge the list into the array.

The result of the benchmark can be found 
[here|https://github.com/thomasrebele/jmh-micro-benchmarks/blob/1178d5345e9ea37ccfc71ff0d447aca4b82c6eb2/core-java/doc/jmh_partial_sort_jdk1.8.0_241.txt#L5082].

Some findings:
 * If there's no limit, then the algorithms that fall back to Arrays.sort(...) 
win. These are collectionSort, priorityQueue and limitSort.
 * The LimitSort is among the algorithms that need the fewest comparisons for 
all benchmarks.
 * Its runtime is not far away from the best, and sometimes even the best.
 * TreeMap is not the best algorithm for sorting the whole array (limit = -1). 
Maybe the implementation of EnumerableDefaults.orderBy(...) should use an 
ArrayList and List.sort(...) instead of a TreeMap. The speedup would be up to 
2x.

> Improve ORDER BY computation in Enumerable convention by exploiting LIMIT
> -------------------------------------------------------------------------
>
>                 Key: CALCITE-3920
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3920
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.22.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Thomas Rebele
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> There are many use-cases (pagination, top-k) relying on queries with an ORDER 
> BY clause followed by a LIMIT. 
> At the moment, the two operations are implemented independently the one from 
> the other 
>  in the Enumerable convention. Even when we know that consumer needs only the 
> top-10 results the sort operation will try to maintain its entire input 
> sorted. The complexity of the sorting operation is O( n ) space and O( nlogn 
> ) time, where n is the size of the input. 
> By implementing ORDER BY and LIMIT together there are various optimizations 
> that can be applied to reduce the space and time complexity of the sorting 
> algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CALCITE-3920) Improve ORDER BY computation in Enumerable convention by exploiting LIMIT

Reply via email to