Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/15213
This doesn't make sense to me. The DAGSchedulerEventProcessLoop runs on a
single thread and processes a single event from its queue at a time.
When the first CompletionEvent is run as a result of a fetch failure,
failedStages is added to and a ResubmitFailedStages event is queued. After
handleTaskCompletion is done, the next event from the queue will be processed.
As events are sequentially dequeued and handled, either the
ResubmitFailedStages event will be handled before the CompletionEvent for the
second fetch failure, or the CompletionEvent will be handled before the
ResubmitFailedStages event. If the ResubmitFailedStages is handled first, then
failedStages will be cleared in resubmitFailedStages, and there will be nothing
preventing the subsequent CompletionEvent from queueing another
ResubmitFailedStages event to handle additional fetch failures. In the
alternative that the second CompletionEvent is queued and handled before the
ResubmitFailedStages event, then the additional stages are added to the
non-empty failedStages, but there is no need to schedule another
ResubmitFailedStages event because the one from
the first CompletionEvent is still on the queue and the handling of that
queued event will also handle the newly added failedStages from the second
CompletionEvent. In either ordering, all the failedStages are handled and
there is no race condition.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]