dongjoon-hyun commented on pull request #29331: URL: https://github.com/apache/spark/pull/29331#issuecomment-669297999
> I get the value of 3x replication for persistent data; this is in theory persistence for data that is already recreateable right? Right, @srowen . > cached data? or am I totally forgetting where else this can be used? Yes. This cut the lineage and works like HDFS replacement. Previously, this can be achieve when you store the RDD into HDFS back. For now, it's difficult in the disaggregated cluster. > If so this doesn't seem as necessary, and even DISK_ONLY_2 feels like overkill. It's not overkill. HDFS replication is not only for reliability. 3x HDFS replication improves the throughput 3x times. Are you sure one executor can serve that traffic, @srowen ? > I suppose one argument we've made in the past is that the 2x replication is to make the cached data available as local data in more places, to improve locality. That could be an argument. Improving locality is just a small fraction. The throughput improvement and reduced FetchFailedException is the real benefit. If we don't have HDFS, this is the only viable option. > But would MEMORY_AND_DISK_3 then make sense? MEMORY_AND_DISK_3 is not recommended here because it has another assumption to have all the data into the memory. It turned out that has a severe side effect when the memory is not enough on the executors. Why do we need to load the data into the memory if it goes down disk back due to Spark operation. This PR aims to serving the data like HDFS conceptually inside Spark to support HDFS-service-free ecosystem. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
