[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Mridul Muralidharan (JIRA) Sun, 18 May 2014 21:36:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001381#comment-14001381
 ]


Mridul Muralidharan commented on SPARK-1855:
--------------------------------------------

Currently, StorageLevel constructor is private to StorageLevel object if I am 
not mistaken : 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala

So it is limited to 2 way replication at best. We can definitely add more to 
the object for 3x or more.
There are couple of things to note there though :

a) As I mentioned before, when we tried replication, the cost becomes very high 
from experience.
To give context, we were trying replicated storage level so that we can 
increase number of process local nodes - performance degraded much below 
without replication (which had added cost of remote fetch + serde for non 
process local tasks).

b) Current replication is very simple - pick some node and push data there.
There are few issues with this :
1) We could pick another executor on the same node.
2) We could pick executors all on same rack.

So node failure or rack failure would mean block is lost entirely. For 
comparison, check how NN does block placement strategy.
Hence addressing this would also need to change how we do replication - else we 
would be introducing possibility of data loss.

Also note that if we are going down this path, there are two things to be kept 
in mind :

1) Relook at Tachyon more seriously for replicated block storage (not just to 
offload process local storage) - I am not sure if it currently provides, or can 
be enhanced to provide this.
2) Instead of hardcoding the placement strategy and the replication count, we 
should make this configurable (if we are doing it ourselves).

> Provide memory-and-local-disk RDD checkpointing
> -----------------------------------------------
>
>                 Key: SPARK-1855
>                 URL: https://issues.apache.org/jira/browse/SPARK-1855
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Xiangrui Meng
>
> Checkpointing is used to cut long lineage while maintaining fault tolerance. 
> The current implementation is HDFS-based. Using the BlockRDD we can create 
> in-memory-and-local-disk (with replication) checkpoints that are not as 
> reliable as HDFS-based solution but faster.
> It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

Reply via email to