You should be fine in 1.6 onward. Count distinct doesn't require data to fit in memory there.
On Thu, Jun 16, 2016 at 1:57 AM, Avshalom <avshalom...@gmail.com> wrote: > Hi all, > > We would like to perform a count distinct query based on a certain filter. > e.g. our data is of the form: > > userId, Name, Restaurant name, Restaurant Type > =============================== > 100, John, Pizza Hut, Pizza > 100, John, Del Pepe, Pasta > 100, John, Hagen Daz, Ice Cream > 100, John, Dominos, Pasta > 200, Mandy, Del Pepe, Pasta > > And we would like to know the number of distinct Pizza eaters. > > The issue is, we have roughly ~200 million entries, so even with a large > cluster, we could still be in a risk of memory overload if the distinct > implementation has to load all of the data into RAM. > The Spark Core implementation which uses reduce to 1 and sum doesn't have > this risk. > > I've found this old thread which compares Spark Core and Spark SQL count > distinct performance: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html > < > http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html > > > > From reading the source code, seems like the current Spark SQL count > distinct implementation is not based on a Hash Set anymore, but we're still > concerned that it won't be as safe as the Spark Core implementation. > We don't mind waiting a long time for the computation to end, but we don't > want to reach out of memory errors. > > Would highly appreciate any input. > > Thanks > Avshalom > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >