[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...

pwendell Fri, 28 Mar 2014 11:14:24 -0700

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/158#discussion_r11078328
  
    --- Diff: docs/scala-programming-guide.md ---
    @@ -307,30 +320,54 @@ In addition, each RDD can be stored using a different 
*storage level*, allowing
     
     ### Which Storage Level to Choose?
     
    -Spark's storage levels are meant to provide different tradeoffs between 
memory usage and CPU efficiency.
    -We recommend going through the following process to select one:
    +Spark's storage levels are meant to provide different trade-offs between 
memory usage and CPU
    +efficiency. It allows uses to choose memory, disk, or Tachyon for storing 
data. We recommend going
    +through the following process to select one:
     
     * If your RDDs fit comfortably with the default storage level 
(`MEMORY_ONLY`), leave them that way. This is the most
       CPU-efficient option, allowing operations on the RDDs to run as fast as 
possible.
     * If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization 
library](tuning.html) to make the objects
    -  much more space-efficient, but still reasonably fast to access.
    +  much more space-efficient, but still reasonably fast to access. You can 
also use `Tachyon` mode
    +  to store the data off the heap in 
[Tachyon](http://tachyon-project.org/). This will significantly
    +  reduce JVM GC overhead.
     * Don't spill to disk unless the functions that computed your datasets are 
expensive, or they filter a large
       amount of the data. Otherwise, recomputing a partition is about as fast 
as reading it from disk.
     * Use the replicated storage levels if you want fast fault recovery (e.g. 
if using Spark to serve requests from a web
       application). *All* the storage levels provide full fault tolerance by 
recomputing lost data, but the replicated ones
       let you continue running tasks on the RDD without waiting to recompute a 
lost partition.
    - 
    -If you want to define your own storage level (say, with replication factor 
of 3 instead of 2), then use the function factor method `apply()` of the 
[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$) 
singleton object.  
    +
    +If you want to define your own storage level (say, with replication factor 
of 3 instead of 2), then
    +use the function factor method `apply()` of the
    
+[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$) 
singleton object.
    +
    +Spark has a block manager inside the Executors that let you chose memory, 
disk, or Tachyon. The
    +latter is for storing RDDs off-heap outside the Executor JVM on top of the 
memory management system
    --- End diff --
    
    It might make sense to say this when defining the `OFF_HEAP` level, then 
you can just refer to it here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1305: Support persisting RDD's directly ...

Reply via email to