Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/668#discussion_r12340137
--- Diff: docs/scala-programming-guide.md ---
@@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive
use from the interpreter.
You can mark an RDD to be persisted using the `persist()` or `cache()`
methods on it. The first time
it is computed in an action, it will be kept in memory on the nodes. The
cache is fault-tolerant --
if any partition of an RDD is lost, it will automatically be recomputed
using the transformations
-that originally created it.
+that originally created it. Note: in a multi-stage job, Spark saves the
map output files from map
--- End diff --
It's a great idea to have this here. This is a totally non-obvious fact and
I think many users would like to know this.
My only thought is, would you mind moving this to the end of the "RDD
Persistence" section. Also, at this point in the guide I don't think the
concept of stages or jobs has been introduced. So it might be good to have
something like:
```
Spark sometimes automatically persists intermediate state from RDD
operations, even without users calling persist() or cache(). In particular, if
a shuffle happens when computing an RDD, Spark will keep the outputs from the
map side of the shuffle on disk to avoid re-computing the entire dependency
graph if an RDD is re-used. We still recommend users call persist() if they
plan to re-use an RDD iteratively.
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---