[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3

GitBox Wed, 05 Aug 2020 09:34:19 -0700


dongjoon-hyun edited a comment on pull request #29331:
URL: https://github.com/apache/spark/pull/29331#issuecomment-669297999



   > I get the value of 3x replication for persistent data; this is in theory 
persistence for data that is already recreateable right? 
   
   Right, @srowen .
   
   > cached data? or am I totally forgetting where else this can be used?
   
   Yes. This cuts the lineage and works like HDFS replacement. Previously, this 
can be achieve when you store the RDD into HDFS back. For now, it's difficult 
in the disaggregated cluster.
   
   > If so this doesn't seem as necessary, and even DISK_ONLY_2 feels like 
overkill.
   
   It's not overkill. HDFS replication is not only for reliability. 3x HDFS 
replication improves the throughput 3x times. Are you sure one executor can 
serve that traffic, @srowen ?
   
   > I suppose one argument we've made in the past is that the 2x replication 
is to make the cached data available as local data in more places, to improve 
locality. That could be an argument.
   
   Improving locality is just a small fraction. The throughput improvement and 
reduced FetchFailedException is the real benefit. If we don't have HDFS, this 
is the only viable option.
   
   > But would MEMORY_AND_DISK_3 then make sense?
   
   MEMORY_AND_DISK_3 is not recommended here because it has another assumption 
to have all the data into the memory. It turned out that has a severe side 
effect when the memory is not enough on the executors. Why do we need to load 
the data into the memory if it goes down disk back due to Spark operation.
   
   This PR aims to serving the data like HDFS conceptually inside Spark to 
support HDFS-service-free ecosystem.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29331: [SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3

Reply via email to