[ 
https://issues.apache.org/jira/browse/IMPALA-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-2791:
-------------------------------------

    Assignee:     (was: Alexander Behm)

> Avoid unnecessary two-phased aggregation.
> -----------------------------------------
>
>                 Key: IMPALA-2791
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2791
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.2, Impala 2.3.0
>            Reporter: Alexander Behm
>            Priority: Minor
>              Labels: performance, planner
>
> We perform a two-phased aggregation for evaluating distinct aggregate 
> expressions like count(distinct). However, if the distinct aggregate 
> expression is not referenced in enclosing query blocks, then the two-phased 
> aggregation is unnecessary and should be skipped.
> Example:
> {code}
>  explain select x from
>   (select count(int_col) x, count(distinct bigint_col) y from 
> functional.alltypes) v;
> +-----------------------------------------------------------+
> | Explain String                                            |
> +-----------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=170.00MB VCores=2 |
> |                                                           |
> | 06:AGGREGATE [FINALIZE]                                   |
> | |  output: count:merge(bigint_col), count:merge(int_col)  |
> | |                                                         |
> | 05:EXCHANGE [UNPARTITIONED]                               |
> | |                                                         |
> | 02:AGGREGATE                                              |
> | |  output: count(bigint_col), count:merge(int_col)        |
> | |                                                         |
> | 04:AGGREGATE                                              |
> | |  output: count:merge(int_col)                           |
> | |  group by: bigint_col                                   |
> | |                                                         |
> | 03:EXCHANGE [HASH(bigint_col)]                            |
> | |                                                         |
> | 01:AGGREGATE                                              |
> | |  output: count(int_col)                                 |
> | |  group by: bigint_col                                   |
> | |                                                         |
> | 00:SCAN HDFS [functional.alltypes]                        |
> |    partitions=24/24 files=24 size=478.45KB                |
> +-----------------------------------------------------------+
> {code}
> In the query above, it is unnecessary to compute the "count(distinct 
> bigint_col)" aggregate expression, so a single-phased aggregation would be 
> sufficient.
> One way to fix this issue would be to defer creation of the AggregateInfo to 
> the planning phase where the materialization of aggregate expressions is 
> known. Currently, we create the AggregateInfo during analysis. Retroactively 
> "fixing" an AggregateInfo during planning to remover the two phases seems 
> complicated.
> This limitation inhibits other optimizations such as IMPALA-2499.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to