dongjoon-hyun edited a comment on pull request #29331:
URL: https://github.com/apache/spark/pull/29331#issuecomment-670090509


   Thank you, @Ngone51 . The user scenario looks like this. The job has a very 
long lineage. In a disaggregated cluster, the executor dies sometime due to 
various reason (including maintenance and preemption) and causes bad effects 
like `FetchFailedException` and frequently retries (not only the direct parent, 
but also the ancestor, too). The is the same as you wrote. So, the user is 
trying to cut the lineage by using `cache` after the shuffle stage. But, it 
turns out that `cache` can cause memory competition as a side-effects. Although 
Spark can spill the disk, they don't want to load the data into the memory from 
the beginning. They inevitabliy decided to choose DISK only. In short, they are 
using DISK_ONLY_1 and DISK_ONLY_2 and currently asking DISK_ONLY_3. It depends 
on the their decision on the individual dataset.
   
   The rational of DISK_ONLY_3 is they want to have the same concept of the 
existing HDFS service.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to