dongjoon-hyun commented on pull request #29331: URL: https://github.com/apache/spark/pull/29331#issuecomment-670090509
Thank you, @Ngone51 . The user scenario looks like this. The job has a very long lineage. In a disaggregated cluster, the executor dies sometime due to various reason (including maintenance and preemption) and causes bad effects like `FetchFailedException` and frequently retries (not only the direct parent, but also the ancestor, too). The is the same as you wrote. So, the user is trying to cut the lineage by using `cache` after the shuffle stage. But, it turns out that `cache` can cause memory competition as a side-effects. Although Spark can spill the disk, they don't want to load the data into the memory from the beginning. They inevitabliy decided to choose DISK only. In short, they are using DISK_ONLY_1 and DISK_ONLY_2 and currently asking DISK_ONLY_3. The rational of DISK_ONLY_3 is they want to have the same concept of the existing HDFS service. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
