Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/668#discussion_r12340137
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -278,10 +278,13 @@ iterative algorithms with Spark and for interactive 
use from the interpreter.
     You can mark an RDD to be persisted using the `persist()` or `cache()` 
methods on it. The first time
     it is computed in an action, it will be kept in memory on the nodes. The 
cache is fault-tolerant --
     if any partition of an RDD is lost, it will automatically be recomputed 
using the transformations
    -that originally created it.
    +that originally created it. Note: in a multi-stage job, Spark saves the 
map output files from map
    --- End diff --
    
    It's a great idea to have this here. This is a totally non-obvious fact and 
I think many users would like to know this.
    
    My only thought is, would you mind moving this to the end of the "RDD 
Persistence" section. Also, at this point in the guide I don't think the 
concept of stages or jobs has been introduced. So it might be good to have 
something like:
    
    ```
    Spark sometimes automatically persists intermediate state from RDD 
operations, even without users calling persist() or cache(). In particular, if 
a shuffle happens when computing an RDD, Spark will keep the outputs from the 
map side of the shuffle on disk to avoid re-computing the entire dependency 
graph if an RDD is re-used. We still recommend users call persist() if they 
plan to re-use an RDD iteratively.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to