Count is different on DataFrames and Datasets from RDDs. On RDDs, it always evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of "select count(*) from ..." in SQL, which can be done without scanning the data for some data formats (e.g. Parquet). On the other hand though, caching a DataFrame / Dataset does require everything to be cached.
Matei > On Feb 18, 2017, at 2:16 AM, Sean Owen <so...@cloudera.com> wrote: > > I think the right answer is "don't do that" but if you really had to you > could trigger a Dataset operation that does nothing per partition. I presume > that would be more reliable because the whole partition has to be computed to > make it available in practice. Or, go so far as to loop over every element. > > On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <nicholas.cham...@gmail.com > <mailto:nicholas.cham...@gmail.com>> wrote: > Especially during development, people often use .count() or > .persist().count() to force evaluation of all rows — exposing any problems, > e.g. due to bad data — and to load data into cache to speed up subsequent > operations. > > But as the optimizer gets smarter, I’m guessing it will eventually learn that > it doesn’t have to do all that work to give the correct count. (This blog > post > <https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html> > suggests that something like this is already happening.) This will change > Spark’s practical behavior while technically preserving semantics. > > What will people need to do then to force evaluation or caching? > > Nick >