Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/158#discussion_r11078328
--- Diff: docs/scala-programming-guide.md ---
@@ -307,30 +320,54 @@ In addition, each RDD can be stored using a different
*storage level*, allowing
### Which Storage Level to Choose?
-Spark's storage levels are meant to provide different tradeoffs between
memory usage and CPU efficiency.
-We recommend going through the following process to select one:
+Spark's storage levels are meant to provide different trade-offs between
memory usage and CPU
+efficiency. It allows uses to choose memory, disk, or Tachyon for storing
data. We recommend going
+through the following process to select one:
* If your RDDs fit comfortably with the default storage level
(`MEMORY_ONLY`), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as
possible.
* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization
library](tuning.html) to make the objects
- much more space-efficient, but still reasonably fast to access.
+ much more space-efficient, but still reasonably fast to access. You can
also use `Tachyon` mode
+ to store the data off the heap in
[Tachyon](http://tachyon-project.org/). This will significantly
+ reduce JVM GC overhead.
* Don't spill to disk unless the functions that computed your datasets are
expensive, or they filter a large
amount of the data. Otherwise, recomputing a partition is about as fast
as reading it from disk.
* Use the replicated storage levels if you want fast fault recovery (e.g.
if using Spark to serve requests from a web
application). *All* the storage levels provide full fault tolerance by
recomputing lost data, but the replicated ones
let you continue running tasks on the RDD without waiting to recompute a
lost partition.
-
-If you want to define your own storage level (say, with replication factor
of 3 instead of 2), then use the function factor method `apply()` of the
[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$)
singleton object.
+
+If you want to define your own storage level (say, with replication factor
of 3 instead of 2), then
+use the function factor method `apply()` of the
+[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$)
singleton object.
+
+Spark has a block manager inside the Executors that let you chose memory,
disk, or Tachyon. The
+latter is for storing RDDs off-heap outside the Executor JVM on top of the
memory management system
--- End diff --
It might make sense to say this when defining the `OFF_HEAP` level, then
you can just refer to it here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---