git commit: Proposal: clarify Scala programming guide on caching ...

pwendell Wed, 14 May 2014 19:43:23 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.0 514ee93da -> 51e277557



Proposal: clarify Scala programming guide on caching ...

... with regards to saved map output. Wording taken partially from Matei 
Zaharia's email to the Spark user list. 
http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

Author: Ethan Jewett <[email protected]>

Closes #668 from esjewett/Doc-update and squashes the following commits:

11793ce [Ethan Jewett] Update based on feedback
171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
(cherry picked from commit 48ba3b8cdc3bdc7c67bc465d1f047fa3f44d7085)

Signed-off-by: Patrick Wendell <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51e27755
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51e27755
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51e27755

Branch: refs/heads/branch-1.0
Commit: 51e27755750e896ac632f4a40b362bd580e21ced
Parents: 514ee93
Author: Ethan Jewett <[email protected]>
Authored: Tue May 6 20:50:08 2014 -0700
Committer: Patrick Wendell <[email protected]>
Committed: Tue May 6 20:50:18 2014 -0700

----------------------------------------------------------------------
 docs/scala-programming-guide.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/51e27755/docs/scala-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index e7ceaa2..f25e9cc 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -145,7 +145,7 @@ RDDs support two types of operations: *transformations*, 
which create a new data
 
 All transformations in Spark are <i>lazy</i>, in that they do not compute 
their results right away. Instead, they just remember the transformations 
applied to some base dataset (e.g. a file). The transformations are only 
computed when an action requires a result to be returned to the driver program. 
This design enables Spark to run more efficiently -- for example, we can 
realize that a dataset created through `map` will be used in a `reduce` and 
return only the result of the `reduce` to the driver, rather than the larger 
mapped dataset.
 
-By default, each transformed RDD is recomputed each time you run an action on 
it. However, you may also *persist* an RDD in memory using the `persist` (or 
`cache`) method, in which case Spark will keep the elements around on the 
cluster for much faster access the next time you query it. There is also 
support for persisting datasets on disk, or replicated across the cluster. The 
next section in this document describes these options.
+By default, each transformed RDD may be recomputed each time you run an action 
on it. However, you may also *persist* an RDD in memory using the `persist` (or 
`cache`) method, in which case Spark will keep the elements around on the 
cluster for much faster access the next time you query it. There is also 
support for persisting datasets on disk, or replicated across the cluster. The 
next section in this document describes these options.
 
 The following tables list the transformations and actions currently supported 
(see also the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD) for 
details):
 
@@ -279,8 +279,8 @@ it is computed in an action, it will be kept in memory on 
the nodes. The cache i
 if any partition of an RDD is lost, it will automatically be recomputed using 
the transformations
 that originally created it.
 
-In addition, each RDD can be stored using a different *storage level*, 
allowing you, for example, to
-persist the dataset on disk, or persist it in memory but as serialized Java 
objects (to save space),
+In addition, each persisted RDD can be stored using a different *storage 
level*, allowing you, for example,
+to persist the dataset on disk, or persist it in memory but as serialized Java 
objects (to save space),
 or replicate it across nodes, or store the data in off-heap memory in 
[Tachyon](http://tachyon-project.org/).
 These levels are chosen by passing a
 
[`org.apache.spark.storage.StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel)
@@ -330,6 +330,8 @@ available storage levels is:
 </tr>
 </table>
 
+Spark sometimes automatically persists intermediate state from RDD operations, 
even without users calling persist() or cache(). In particular, if a shuffle 
happens when computing an RDD, Spark will keep the outputs from the map side of 
the shuffle on disk to avoid re-computing the entire dependency graph if an RDD 
is re-used. We still recommend users call persist() if they plan to re-use an 
RDD iteratively.
+
 ### Which Storage Level to Choose?
 
 Spark's storage levels are meant to provide different trade-offs between 
memory usage and CPU

git commit: Proposal: clarify Scala programming guide on caching ...

Reply via email to