Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/21698
Actually I think @mridulm have a point here - if we only retry all the
tasks for repartition/zip*, it's still possible that some tasks in a succeeding
stage may have finished before retry, and outputs that shall goes to those
tasks/partitions may be ignored and that causes data loss.
If we are going down the current way to fix the issue, we have to track all
the (partially) finished succeeding stages of a stage, and when we retry a
stage, we also retry all the succeeding stages. One typical scenario is that
when you hit an ExecutorLost, and you have to retry a chain of finished stages.
However, this approach can still be useful, because if you don't hit a
FetchFailure in your job containing multiple repartition()/zip* actions, you
don't have to pay for the performance penalty (This is a major difference from
the insert-local-sort approach). However, I now tend to add an extra flag to
enable users to disable the approach (when they believe they are aware of the
correctness issue and have better way to handle that - such as manually do
checkpointing).
Actually you are not able to sort a RDD if the data type is not sortable,
so you either retry the stage (and all the finished tasks from succeeding
stages), or users have to manually handle that. Given by that, I still think
this approach can be useful, esp. for small jobs that are not likely to hit
FetchFailure, or for complex jobs on small dataset and users care correctness
more over performance. Note that, with an extra flag, users can always disable
this approach, and the behavior/performance shall be exactly the same as
previous versions.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]