[ 
https://issues.apache.org/jira/browse/TAJO-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129698#comment-14129698
 ] 

ASF GitHub Bot commented on TAJO-1010:
--------------------------------------

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/136

    TAJO-1010: Improve multiple DISTINCT aggregation. (hyoungjun, jaehwa)

    Tajo supports various options for count distinct. Current option is to 
execute a count distinct query with two execution blocks. It made by 
DistinctGroupbyBuilder::buildPlan. But now, new option is to execute the query 
with three execution blocks. You can use this option for set 
SessionVars.COUNT_DISTINCT_ALGORITHM to three_stages.
    
    * In first stage, tajo operator incremented each row to more rows by 
grouping columns. In addition, the operator must creates each row because of 
aggregation non-distinct columns.
    * In second stage, tajo operator aggregates the output of the first stage. 
For reference, it shuffled by grouping columns and aggregation columns.
    * In third stage, tajo operator merges the output of the second stage. For 
reference, it shuffled by just grouping columns.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo TAJO-1010

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #136
    
----
commit 615d84f13e8dd496c9c096cf2eeb6f7e3e16dfa2
Author: Jaehwa Jung <[email protected]>
Date:   2014-09-11T06:30:31Z

    TAJO-1010: Improve multiple DISTINCT aggregation. (hyoungjun, jaehwa)

----


> Improve multiple DISTINCT aggregation.
> --------------------------------------
>
>                 Key: TAJO-1010
>                 URL: https://issues.apache.org/jira/browse/TAJO-1010
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>
> Currently, tajo provides three stage for optimizing distinct query 
> aggregation. But it just supports one column for distinct aggregation as 
> follows:
> {code:title=Query1|borderStyle=solid}
> select a.flag, count(distinct a.id) as cnt, sum(distinct a.id) as total
> from table1
> group by a.flag
> {code}
> If you write two more columns for distinct aggregation, you can't apply 
> optimized distinct aggregation as follows:
> {code:title=Query2|borderStyle=solid}
> select a.flag, count(distinct a.id) as cnt, sum(distinct a.id) as total
> , count(distinct a.name) as cnt2, count(distinct a.code) as cnt3
> from table1
> group by a.flag
> {code}
> In this case, you may see low performance for your query. Thus, we need to 
> improve multiple DISTINCT aggregation. Correctly, we should support three 
> stage for multiple DISTINCT aggregation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to