[
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998799#comment-14998799
]
Piotr Niemcunowicz commented on SPARK-4243:
-------------------------------------------
Same happens when one uses HiveContext.
> Spark SQL SELECT COUNT DISTINCT optimization
> --------------------------------------------
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
> parquetFile.registerTempTable("parquetFile")
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases:
> 3.2min
> 17s
> New query stats by phases:
> 0.3 s
> 16 s
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4)
> FROM parquetFile
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]