Count is different on DataFrames and Datasets from RDDs. On RDDs, it always 
evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of 
"select count(*) from ..." in SQL, which can be done without scanning the data 
for some data formats (e.g. Parquet). On the other hand though, caching a 
DataFrame / Dataset does require everything to be cached.

Matei

> On Feb 18, 2017, at 2:16 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> I think the right answer is "don't do that" but if you really had to you 
> could trigger a Dataset operation that does nothing per partition. I presume 
> that would be more reliable because the whole partition has to be computed to 
> make it available in practice. Or, go so far as to loop over every element.
> 
> On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <nicholas.cham...@gmail.com 
> <mailto:nicholas.cham...@gmail.com>> wrote:
> Especially during development, people often use .count() or 
> .persist().count() to force evaluation of all rows — exposing any problems, 
> e.g. due to bad data — and to load data into cache to speed up subsequent 
> operations.
> 
> But as the optimizer gets smarter, I’m guessing it will eventually learn that 
> it doesn’t have to do all that work to give the correct count. (This blog 
> post 
> <https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
>  suggests that something like this is already happening.) This will change 
> Spark’s practical behavior while technically preserving semantics.
> 
> What will people need to do then to force evaluation or caching?
> 
> Nick
> 

Reply via email to