Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17297
Thanks a lot @squito for taking a look at it and for your feedback.
>> this is already true. when there is a fetch failure, the TaskSetManager
is marked as zombie, and the DAGScheduler resubmits stages, but nothing
actively kills running tasks.
That is true but currently the DAG scheduler has no idea about which tasks
are running and which are being aborted. With this change, the task set manager
informs the dag scheduler about currently running/aborted tasks so that the DAG
scheduler can avoid resubmitting duplicates.
>> I don't think its true that it relaunches all tasks that hadn't
completed when the fetch failure occurred. it relaunches all the tasks haven't
completed, by the time the stage gets resubmitted. More tasks can complete in
between the time of the first failure, and the time the stage is resubmitted.
Yes that's true. I will update the PR description.
>> So I think in (b) and (c), you are trying to avoid resubmitting tasks
3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe
that the original version of those tasks will fail. Most likely, those tasks
needs map output from the same executor that caused the first fetch failure. So
Kay is suggesting that we take the opposite approach, and instead actively kill
the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may
have been transient or (ii) the tasks already finished fetching that data
before the error occurred. We really have no idea.
In our case, we are observing that any transient issue on the shuffle
service might cause few tasks to fail. While other reducers might not see the
fetch failure because either they already fetched the data from that shuffle
service or they are yet to fetch it. Killing all the reducers in those cases
is waste of a lot of work and also as I mentioned above, we might end of in a
state where jobs will not make any progress at all in case of frequent fetch
failure, because they will just flip-flop between two stage.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]