Hi all,

We would like to perform a count distinct query based on a certain filter.
e.g. our data is of the form:

userId, Name, Restaurant name, Restaurant Type
===============================
100,    John,     Pizza Hut,            Pizza
100,    John,     Del Pepe,             Pasta
100,    John,     Hagen Daz,          Ice Cream
100,    John,     Dominos,             Pasta
200,    Mandy,  Del Pepe,             Pasta

And we would like to know the number of distinct Pizza eaters.

The issue is, we have roughly ~200 million entries, so even with a large
cluster, we could still be in a risk of memory overload if the distinct
implementation has to load all of the data into RAM.
The Spark Core implementation which uses reduce to 1 and sum doesn't have
this risk.

I've found this old thread which compares Spark Core and Spark SQL count
distinct performance:

http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
<http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html>
  

>From reading the source code, seems like the current Spark SQL count
distinct implementation is not based on a Hash Set anymore, but we're still
concerned that it won't be as safe as the Spark Core implementation.
We don't mind waiting a long time for the computation to end, but we don't
want to reach out of memory errors.

Would highly appreciate any input.

Thanks
Avshalom



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to