[jira] [Commented] (CALCITE-3181) Support limit per group in Window

Haisheng Yuan (JIRA) Tue, 09 Jul 2019 03:26:15 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881121#comment-16881121
 ]


Haisheng Yuan commented on CALCITE-3181:
----------------------------------------

e.g. compute the total page views for the top 10 hot-sale products in each 
category:

{code:java}
select category, product, pv, 
     sum(pv) over(partition by category order by sale_count desc limit 10) from 
prod;
{code}

Instead of sort-based window implementation, we can use hash table based 
implementation for window with limit. Use the partition key as the hash key, 
each distinct key has an entry that contains a heap if there is order by 
clause, with the heap size be the limit value. In distributed system, we can 
have 2-phase window (global and local), execute local window before shuffling 
tuples by the window partition key, which might reduce the shuffle cost 
dramatically.

But the side effect is also obvious: this will break the assumption that window 
operator keeps the cardinality unchanged.


> Support limit per group in Window
> ---------------------------------
>
>                 Key: CALCITE-3181
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3181
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Haisheng Yuan
>            Priority: Major
>
> We have a lot of queries like the following to retrieve top N tuples per 
> group:
> {code:java}
> SELECT x, y FROM
>      (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) 
>      AS rn FROM t1) t2 WHERE rn <= 3;
> {code}
> The performance is not good if each group has a lot more tuples than wanted, 
> because we will retrieve and sort all the tuples, instead of just doing a 
> top-N heap sort.
> In order to do optimization for this kind of query, we need to extend window 
> to support limit, if and only if there is only 1 window function, and it is 
> {{row_number()}}. We also need a substitute rule to push the limit into 
> window. Of course, we also need to modify executor to support this 
> optimization (can be later).
> {code:java}
> Filter (rn <= 3)
>   +- Window (window#0={Partition by x order by y ROW_NUMBER()})
> {code}
> to
> {code:java}
> Filter (rn <= 3)
>   +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()})
> {code}
> Thoughts? Objections?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CALCITE-3181) Support limit per group in Window

Reply via email to