[ https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998813#comment-14998813 ]
Yin Huai commented on SPARK-4243: --------------------------------- The optimization of {{SELECT COUNT(DISTINCT f2) FROM parquetFile}} will be done as a part of https://github.com/apache/spark/pull/9556. We will rewrite the query to an equivalent form of {{SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a}}. For the improvement of {{SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile }}, it is part of SPARK-9241. > Spark SQL SELECT COUNT DISTINCT optimization > -------------------------------------------- > > Key: SPARK-4243 > URL: https://issues.apache.org/jira/browse/SPARK-4243 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 1.1.0 > Reporter: Bojan Kostić > > Spark SQL runs slow when using this code: > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") > parquetFile.registerTempTable("parquetFile") > val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") > count.map(t => t(0)).collect().foreach(println) > {code} > But with this query it runs much faster: > {code} > SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a > {code} > Old queries stats by phases: > 3.2min > 17s > New query stats by phases: > 0.3 s > 16 s > 20 s > Maybe you should also see this query for optimization: > {code} > SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) > FROM parquetFile > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org