[ 
https://issues.apache.org/jira/browse/TAJO-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903986#comment-13903986
 ] 

Hyunsik Choi edited comment on TAJO-601 at 2/18/14 1:30 PM:
------------------------------------------------------------

Created a review request against branch master in reviewboard :
https://reviews.apache.org/r/18210/


was (Author: hyunsik):
Created a review request against branch master in reviewboard 


> Improve distinct aggregation query processing
> ---------------------------------------------
>
>                 Key: TAJO-601
>                 URL: https://issues.apache.org/jira/browse/TAJO-601
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>         Attachments: TAJO-601.patch
>
>
> Currently, distinct aggregation queries are executed as follows:
> * the first stage: it just shuffles tuples by hashing grouping keys.
> * the second stage: it sorts them and executes sort aggregation.
> This way executes queries including distinct aggregation functions with only 
> two stages. But, it leads to large intermediate data during shuffle phase.
> This kind of query can be rewritten as two queries:
> {code:title=original query}
> SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col 
> from rel1 group by grp1, grp2;
> {code}
> {code:title=rewritten query}
> SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
>   SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, 
> grp3) tmp1 group by grp1, grp2
> ) table1;
> {code}
> I'm expecting that this rewrite will significantly reduce the intermediate 
> data volume and query response time in most cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to