[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

Thomas Tauber-Marshall (JIRA) Tue, 23 Oct 2018 13:22:08 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661227#comment-16661227
 ]


Thomas Tauber-Marshall commented on IMPALA-110:
-----------------------------------------------

APPX_COUNT_DISTINCT is still present in 3.1, but we should mark it as 
deprecated. We probably won't get around to actually removing it for awhile, 
though.

Any user who wishes to get the perf gains from using ndv can always use ndv() 
directly in their query. The motivation behind appx_count_distinct was to help 
people transition from other systems that supported multiple distinct without 
having to rewrite queries. We can now support this directly.

For anyone wondering how it'll interact with multiple distinct, consider the 
query:
select count(distinct a), count(distinct b), sum(distinct c) from ...
Previously, we would not be able to support this query even with 
appx_count_distinct=true as the 'sum' will not be rewritten and there will 
still be 2 DISTINCT classes - one for the ndvs and one for the sum' Now, with 
IMPALA-110 we can execute this query, both with appx_count_distinct=true (with 
2 DISTINCT classes) and with appx_count_distinct=false (with 3 DISTINCT classes)

> Add support for multiple distinct operators in the same query block
> -------------------------------------------------------------------
>
>                 Key: IMPALA-110
>                 URL: https://issues.apache.org/jira/browse/IMPALA-110
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend, Frontend
>    Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala 
> 2.3.0
>            Reporter: Greg Rahn
>            Assignee: Thomas Tauber-Marshall
>            Priority: Major
>              Labels: sql-language
>             Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the 
> distinct values for a column by specifying NDV(column); a query can contain 
> multiple instances of NDV(column). To make Impala automatically rewrite 
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query 
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct 
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in 
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
>       at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
>       at 
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
>       at 
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT 
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT 
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
>       at 
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
>       at 
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
>       at 
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
>       at 
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
>       at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
>       at 
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
>       ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from 
> item;"
> Logging initialized using configuration in 
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Starting Job = job_201302081514_0073, Tracking URL = 
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job  
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%,  reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.81 
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:56,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:57,413 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> 2013-03-05 22:34:58,424 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 8.58 sec
> MapReduce Total cumulative CPU time: 8 seconds 580 msec
> Ended Job = job_201302081514_0073
> MapReduce Jobs Launched: 
> Job 0: Map: 1  Reduce: 1   Cumulative CPU: 8.58 sec   HDFS Read: 0 HDFS 
> Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 8 seconds 580 msec
> OK
> 16    952
> Time taken: 25.666 seconds
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-110) Add support for multiple distinct operators in the same query block

Reply via email to