[
https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001381#comment-14001381
]
Mridul Muralidharan commented on SPARK-1855:
--------------------------------------------
Currently, StorageLevel constructor is private to StorageLevel object if I am
not mistaken :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala
So it is limited to 2 way replication at best. We can definitely add more to
the object for 3x or more.
There are couple of things to note there though :
a) As I mentioned before, when we tried replication, the cost becomes very high
from experience.
To give context, we were trying replicated storage level so that we can
increase number of process local nodes - performance degraded much below
without replication (which had added cost of remote fetch + serde for non
process local tasks).
b) Current replication is very simple - pick some node and push data there.
There are few issues with this :
1) We could pick another executor on the same node.
2) We could pick executors all on same rack.
So node failure or rack failure would mean block is lost entirely. For
comparison, check how NN does block placement strategy.
Hence addressing this would also need to change how we do replication - else we
would be introducing possibility of data loss.
Also note that if we are going down this path, there are two things to be kept
in mind :
1) Relook at Tachyon more seriously for replicated block storage (not just to
offload process local storage) - I am not sure if it currently provides, or can
be enhanced to provide this.
2) Instead of hardcoding the placement strategy and the replication count, we
should make this configurable (if we are doing it ourselves).
> Provide memory-and-local-disk RDD checkpointing
> -----------------------------------------------
>
> Key: SPARK-1855
> URL: https://issues.apache.org/jira/browse/SPARK-1855
> Project: Spark
> Issue Type: New Feature
> Components: MLlib, Spark Core
> Affects Versions: 1.0.0
> Reporter: Xiangrui Meng
>
> Checkpointing is used to cut long lineage while maintaining fault tolerance.
> The current implementation is HDFS-based. Using the BlockRDD we can create
> in-memory-and-local-disk (with replication) checkpoints that are not as
> reliable as HDFS-based solution but faster.
> It can help applications that require many iterations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)