[
https://issues.apache.org/jira/browse/IMPALA-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664030#comment-16664030
]
Thomas Tauber-Marshall commented on IMPALA-110:
-----------------------------------------------
Alright so it sounds like we should just leave appx_count_distinct as is.
[~arodoni_cloudera] we'll still need to update the docs page to remove info
about using it to get around our lack of support for multiple count distinct
and probably add some info about the perf implications.
> Add support for multiple distinct operators in the same query block
> -------------------------------------------------------------------
>
> Key: IMPALA-110
> URL: https://issues.apache.org/jira/browse/IMPALA-110
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend, Frontend
> Affects Versions: Impala 0.5, Impala 1.4, Impala 2.0, Impala 2.2, Impala
> 2.3.0
> Reporter: Greg Rahn
> Assignee: Thomas Tauber-Marshall
> Priority: Major
> Labels: sql-language
> Fix For: Impala 3.1.0
>
>
> Impala only allows a single (DISTINCT columns) expression in each query.
> {color:red}Note:
> If you do not need precise accuracy, you can produce an estimate of the
> distinct values for a column by specifying NDV(column); a query can contain
> multiple instances of NDV(column). To make Impala automatically rewrite
> COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query
> option.
> {color}
> {code}
> [impala:21000] > select count(distinct i_class_id) from item;
> Query: select count(distinct i_class_id) from item
> Query finished, fetching results ...
> 16
> Returned 1 row(s) in 1.51s
> {code}
> {code}
> [impala:21000] > select count(distinct i_class_id), count(distinct
> i_brand_id) from item;
> Query: select count(distinct i_class_id), count(distinct i_brand_id) from item
> ERROR: com.cloudera.impala.common.AnalysisException: Analysis exception (in
> select count(distinct i_class_id), count(distinct i_brand_id) from item)
> at
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:133)
> at
> com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:221)
> at
> com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:89)
> Caused by: com.cloudera.impala.common.AnalysisException: all DISTINCT
> aggregate functions need to have the same set of parameters as COUNT(DISTINCT
> i_class_id); deviating function: COUNT(DISTINCT i_brand_id)
> at
> com.cloudera.impala.analysis.AggregateInfo.createDistinctAggInfo(AggregateInfo.java:196)
> at
> com.cloudera.impala.analysis.AggregateInfo.create(AggregateInfo.java:143)
> at
> com.cloudera.impala.analysis.SelectStmt.createAggInfo(SelectStmt.java:466)
> at
> com.cloudera.impala.analysis.SelectStmt.analyzeAggregation(SelectStmt.java:347)
> at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
> at
> com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:130)
> ... 2 more
> {code}
> Hive supports this:
> {code}
> $ hive -e "select count(distinct i_class_id), count(distinct i_brand_id) from
> item;"
> Logging initialized using configuration in
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history file=/tmp/grahn/hive_job_log_grahn_201303052234_1625576708.txt
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
> set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> set mapred.reduce.tasks=<number>
> Starting Job = job_201302081514_0073, Tracking URL =
> http://impala:50030/jobdetails.jsp?jobid=job_201302081514_0073
> Kill Command = /usr/lib/hadoop/bin/hadoop job
> -Dmapred.job.tracker=m0525.mtv.cloudera.com:8021 -kill job_201302081514_0073
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers:
> 1
> 2013-03-05 22:34:43,255 Stage-1 map = 0%, reduce = 0%
> 2013-03-05 22:34:49,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:50,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:51,351 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:52,360 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:53,370 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:54,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81
> sec
> 2013-03-05 22:34:55,389 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
> 8.58 sec
> 2013-03-05 22:34:56,402 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
> 8.58 sec
> 2013-03-05 22:34:57,413 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
> 8.58 sec
> 2013-03-05 22:34:58,424 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
> 8.58 sec
> MapReduce Total cumulative CPU time: 8 seconds 580 msec
> Ended Job = job_201302081514_0073
> MapReduce Jobs Launched:
> Job 0: Map: 1 Reduce: 1 Cumulative CPU: 8.58 sec HDFS Read: 0 HDFS
> Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 8 seconds 580 msec
> OK
> 16 952
> Time taken: 25.666 seconds
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]