Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/158#discussion_r11078295
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -265,11 +265,24 @@ A complete list of actions is available in the [RDD 
API doc](api/core/index.html
     
     ## RDD Persistence
     
    -One of the most important capabilities in Spark is *persisting* (or 
*caching*) a dataset in memory across operations. When you persist an RDD, each 
node stores any slices of it that it computes in memory and reuses them in 
other actions on that dataset (or datasets derived from it). This allows future 
actions to be much faster (often by more than 10x). Caching is a key tool for 
building iterative algorithms with Spark and for interactive use from the 
interpreter.
    -
    -You can mark an RDD to be persisted using the `persist()` or `cache()` 
methods on it. The first time it is computed in an action, it will be kept in 
memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD 
is lost, it will automatically be recomputed using the transformations that 
originally created it.
    -
    -In addition, each RDD can be stored using a different *storage level*, 
allowing you, for example, to persist the dataset on disk, or persist it in 
memory but as serialized Java objects (to save space), or even replicate it 
across nodes. These levels are chosen by passing a 
[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel)
 object to `persist()`. The `cache()` method is a shorthand for using the 
default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized 
objects in memory). The complete set of available storage levels is:
    +One of the most important capabilities in Spark is *persisting* (or 
*caching*) a dataset in memory
    +across operations. When you persist an RDD, each node stores any slices of 
it that it computes in
    +memory and reuses them in other actions on that dataset (or datasets 
derived from it). This allows
    +future actions to be much faster (often by more than 10x). Caching is a 
key tool for building
    +iterative algorithms with Spark and for interactive use from the 
interpreter.
    +
    +You can mark an RDD to be persisted using the `persist()` or `cache()` 
methods on it. The first time
    +it is computed in an action, it will be kept in memory on the nodes. The 
cache is fault-tolerant --
    +if any partition of an RDD is lost, it will automatically be recomputed 
using the transformations
    +that originally created it.
    +
    +In addition, each RDD can be stored using a different *storage level*, 
allowing you, for example, to
    +persist the dataset on disk, or persist it in memory but as serialized 
Java objects (to save space),
    +or even replicate it across nodes. These levels are chosen by passing a
    
+[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel)
    +object to `persist()`. The `cache()` method is a shorthand for using the 
default storage level,
    +which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in 
memory). The complete set of
    +available storage levels is:
     
     <table class="table">
    --- End diff --
    
    @haoyuan should we add OFF_HEAP to this list? Otherwise users may not be 
able to easily discover this feature. I'd say it provides support for storing 
in Tachyon (link to the Tachyon page) and that it's an alpha feautre.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to