[ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411903#comment-15411903
 ] 

Shubham Chopra commented on SPARK-15354:
----------------------------------------

This is a part of the larger umbrella jira 
[SPARK-15352|https://issues.apache.org/jira/browse/SPARK-15352]. We envision 
using Spark to respond to online queries at near real-time, with data cached in 
RDDs/DataFrames. In case of failures, where blocks of data backing an 
RDD/DataFrame are lost, regenerating data at "query time" would result in a 
significant hit on query performance. The lineage will ensure there is no 
data-loss, but we also need to ensure query performance doesn't degrade. 

The cached data, therefore, needs to be fail-safe to some extent. This problem 
is very similar to HDFS replication which results in a higher read traffic and 
high availability. 

When run on Yarn, Spark executors are housed inside Yarn containers and there 
can be multiple containers on the same node. With random block placement, if 
both replicas of a block end up on the same node, you are susceptible to node 
loss. An HDFS like approach, would be a very basic way of addressing this. Note 
that the hosts chosen through this would also be random. Depending on the 
replication objectives to be met depending on different topologies, a more 
general approach would work better.

> Topology aware block replication strategies
> -------------------------------------------
>
>                 Key: SPARK-15354
>                 URL: https://issues.apache.org/jira/browse/SPARK-15354
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Mesos, Spark Core, YARN
>            Reporter: Shubham Chopra
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to