You should be fine in 1.6 onward. Count distinct doesn't require data to
fit in memory there.


On Thu, Jun 16, 2016 at 1:57 AM, Avshalom <avshalom...@gmail.com> wrote:

> Hi all,
>
> We would like to perform a count distinct query based on a certain filter.
> e.g. our data is of the form:
>
> userId, Name, Restaurant name, Restaurant Type
> ===============================
> 100,    John,     Pizza Hut,            Pizza
> 100,    John,     Del Pepe,             Pasta
> 100,    John,     Hagen Daz,          Ice Cream
> 100,    John,     Dominos,             Pasta
> 200,    Mandy,  Del Pepe,             Pasta
>
> And we would like to know the number of distinct Pizza eaters.
>
> The issue is, we have roughly ~200 million entries, so even with a large
> cluster, we could still be in a risk of memory overload if the distinct
> implementation has to load all of the data into RAM.
> The Spark Core implementation which uses reduce to 1 and sum doesn't have
> this risk.
>
> I've found this old thread which compares Spark Core and Spark SQL count
> distinct performance:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> >
>
> From reading the source code, seems like the current Spark SQL count
> distinct implementation is not based on a Hash Set anymore, but we're still
> concerned that it won't be as safe as the Spark Core implementation.
> We don't mind waiting a long time for the computation to end, but we don't
> want to reach out of memory errors.
>
> Would highly appreciate any input.
>
> Thanks
> Avshalom
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to