[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411903#comment-15411903 ]
Shubham Chopra commented on SPARK-15354: ---------------------------------------- This is a part of the larger umbrella jira [SPARK-15352|https://issues.apache.org/jira/browse/SPARK-15352]. We envision using Spark to respond to online queries at near real-time, with data cached in RDDs/DataFrames. In case of failures, where blocks of data backing an RDD/DataFrame are lost, regenerating data at "query time" would result in a significant hit on query performance. The lineage will ensure there is no data-loss, but we also need to ensure query performance doesn't degrade. The cached data, therefore, needs to be fail-safe to some extent. This problem is very similar to HDFS replication which results in a higher read traffic and high availability. When run on Yarn, Spark executors are housed inside Yarn containers and there can be multiple containers on the same node. With random block placement, if both replicas of a block end up on the same node, you are susceptible to node loss. An HDFS like approach, would be a very basic way of addressing this. Note that the hosts chosen through this would also be random. Depending on the replication objectives to be met depending on different topologies, a more general approach would work better. > Topology aware block replication strategies > ------------------------------------------- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN > Reporter: Shubham Chopra > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org