While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
Map partitions phase finished fast, but collect phase is slow.
It's only runs on single executor.
Should this run this way?

And here is the simple code which i use for testing:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
parquetFile.registerTempTable("parquetFile")
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
count.map(t => t(0)).collect().foreach(println)

I guess because of the distinct process must be on single node. But i wonder
can i add some parallelism to the collect process.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to