[ 
https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998813#comment-14998813
 ] 

Yin Huai commented on SPARK-4243:
---------------------------------

The optimization of {{SELECT COUNT(DISTINCT f2) FROM parquetFile}} will be done 
as a part of https://github.com/apache/spark/pull/9556. We will rewrite the 
query to an equivalent form of {{SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM 
parquetFile) a}}.

For the improvement of {{SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT 
f3), COUNT(DISTINCT f4) FROM parquetFile 
}}, it is part of SPARK-9241.

> Spark SQL SELECT COUNT DISTINCT optimization
> --------------------------------------------
>
>                 Key: SPARK-4243
>                 URL: https://issues.apache.org/jira/browse/SPARK-4243
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
> FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to