Bojan Kostić created SPARK-4243:
-----------------------------------

             Summary: Spark SQL SELECT COUNT DISTINCT optimization
                 Key: SPARK-4243
                 URL: https://issues.apache.org/jira/browse/SPARK-4243
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.1.0
            Reporter: Bojan Kostić


Spark SQL runs slow when using this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
parquetFile.registerTempTable("parquetFile") 
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
count.map(t => t(0)).collect().foreach(println)

But with this query it runs much faster:
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a

Old queries stats by phases: 
3.2min 
17s 
New query stats by phases: 
0.3 s 
16 s 
20 s

Maybe you should also see this query for optimization:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) 
FROM parquetFile 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to