Hi all, We would like to perform a count distinct query based on a certain filter. e.g. our data is of the form:
userId, Name, Restaurant name, Restaurant Type =============================== 100, John, Pizza Hut, Pizza 100, John, Del Pepe, Pasta 100, John, Hagen Daz, Ice Cream 100, John, Dominos, Pasta 200, Mandy, Del Pepe, Pasta And we would like to know the number of distinct Pizza eaters. The issue is, we have roughly ~200 million entries, so even with a large cluster, we could still be in a risk of memory overload if the distinct implementation has to load all of the data into RAM. The Spark Core implementation which uses reduce to 1 and sum doesn't have this risk. I've found this old thread which compares Spark Core and Spark SQL count distinct performance: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html <http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html> >From reading the source code, seems like the current Spark SQL count distinct implementation is not based on a Hash Set anymore, but we're still concerned that it won't be as safe as the Spark Core implementation. We don't mind waiting a long time for the computation to end, but we don't want to reach out of memory errors. Would highly appreciate any input. Thanks Avshalom -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org