[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...

pwendell Fri, 28 Mar 2014 11:13:19 -0700

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/158#discussion_r11078295

--- Diff: docs/scala-programming-guide.md ---
@@ -265,11 +265,24 @@ A complete list of actions is available in the [RDD
API doc](api/core/index.html

## RDD Persistence

-One of the most important capabilities in Spark is *persisting* (or
*caching*) a dataset in memory across operations. When you persist an RDD, each
node stores any slices of it that it computes in memory and reuses them in
other actions on that dataset (or datasets derived from it). This allows future
actions to be much faster (often by more than 10x). Caching is a key tool for
building iterative algorithms with Spark and for interactive use from the
interpreter.
-
-You can mark an RDD to be persisted using the `persist()` or `cache()`
methods on it. The first time it is computed in an action, it will be kept in
memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD
is lost, it will automatically be recomputed using the transformations that
originally created it.
-
-In addition, each RDD can be stored using a different *storage level*,
allowing you, for example, to persist the dataset on disk, or persist it in
memory but as serialized Java objects (to save space), or even replicate it
across nodes. These levels are chosen by passing a
[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel)
object to `persist()`. The `cache()` method is a shorthand for using the
default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized
objects in memory). The complete set of available storage levels is:
+One of the most important capabilities in Spark is *persisting* (or
*caching*) a dataset in memory
+across operations. When you persist an RDD, each node stores any slices of
it that it computes in
+memory and reuses them in other actions on that dataset (or datasets
derived from it). This allows
+future actions to be much faster (often by more than 10x). Caching is a
key tool for building
+iterative algorithms with Spark and for interactive use from the
interpreter.
+
+You can mark an RDD to be persisted using the `persist()` or `cache()`
methods on it. The first time
+it is computed in an action, it will be kept in memory on the nodes. The
cache is fault-tolerant --
+if any partition of an RDD is lost, it will automatically be recomputed
using the transformations
+that originally created it.
+
+In addition, each RDD can be stored using a different *storage level*,
allowing you, for example, to
+persist the dataset on disk, or persist it in memory but as serialized
Java objects (to save space),
+or even replicate it across nodes. These levels are chosen by passing a

+[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel)
+object to `persist()`. The `cache()` method is a shorthand for using the
default storage level,
+which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in
memory). The complete set of
+available storage levels is:

<table class="table">
--- End diff --

@haoyuan should we add OFF_HEAP to this list? Otherwise users may not be
able to easily discover this feature. I'd say it provides support for storing
in Tachyon (link to the Tachyon page) and that it's an alpha feautre.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...

Reply via email to