Especially during development, people often use .count() or
.persist().count() to force evaluation of all rows — exposing any problems,
e.g. due to bad data — and to load data into cache to speed up subsequent
operations.

But as the optimizer gets smarter, I’m guessing it will eventually learn
that it doesn’t have to do all that work to give the correct count. (This
blog post
<https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
suggests that something like this is already happening.) This will change
Spark’s practical behavior while technically preserving semantics.

What will people need to do then to force evaluation or caching?

Nick
​

Reply via email to