Bojan Kostić created SPARK-4243: ----------------------------------- Summary: Spark SQL SELECT COUNT DISTINCT optimization Key: SPARK-4243 URL: https://issues.apache.org/jira/browse/SPARK-4243 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Bojan Kostić
Spark SQL runs slow when using this code: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") parquetFile.registerTempTable("parquetFile") val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") count.map(t => t(0)).collect().foreach(println) But with this query it runs much faster: SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a Old queries stats by phases: 3.2min 17s New query stats by phases: 0.3 s 16 s 20 s Maybe you should also see this query for optimization: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org