Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    Actually I think @mridulm have a point here - if we only retry all the 
tasks for repartition/zip*, it's still possible that some tasks in a succeeding 
stage may have finished before retry, and outputs that shall goes to those 
tasks/partitions may be ignored and that causes data loss.
    
    If we are going down the current way to fix the issue, we have to track all 
the (partially) finished succeeding stages of a stage, and when we retry a 
stage, we also retry all the succeeding stages. One typical scenario is that 
when you hit an ExecutorLost, and you have to retry a chain of finished stages. 
    
    However, this approach can still be useful, because if you don't hit a 
FetchFailure in your job containing multiple repartition()/zip* actions, you 
don't have to pay for the performance penalty (This is a major difference from 
the insert-local-sort approach). However, I now tend to add an extra flag to 
enable users to disable the approach (when they believe they are aware of the 
correctness issue and have better way to handle that - such as manually do 
checkpointing).
    
    Actually you are not able to sort a RDD if the data type is not sortable, 
so you either retry the stage (and all the finished tasks from  succeeding 
stages), or users have to manually handle that. Given by that, I still think 
this approach can be useful, esp. for small jobs that are not likely to hit 
FetchFailure, or for complex jobs on small dataset and users care correctness 
more over performance. Note that, with an extra flag, users can always disable 
this approach, and the behavior/performance shall be exactly the same as 
previous versions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to