[
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775188#comment-16775188
]
Parth Gandhi commented on SPARK-25250:
--------------------------------------
[~Ngone51] I understand that you had a proposal and we were actively discussing
on various solutions in the PR #22806 , but however, I have been working on
that PR tirelessly for a few months and we still have an ongoing discussion
going on there. Any specific reasons as to why did you create your own PR for
the same issue? WDYT [~irashid] [~cloud_fan] ?
> Race condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 2.3.1
> Reporter: Parth Gandhi
> Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from
> previous stage attempt just finished before new attempt for the same stage
> was created due to fetch failure, so the new task created in the second
> attempt on the same partition id was retrying multiple times due to
> TaskCommitDenied Exception without realizing that the task in earlier attempt
> was already successful.
> For example, consider a task with partition id 9000 and index 9000 running in
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1.
> Just within this timespan, the above task completes successfully, thus,
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has
> not yet been created, the taskset info for that stage is not available to the
> TaskScheduler so, naturally, the partition id 9000 has not been marked
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same
> partition id 9000. This task fails due to CommitDeniedException and since, it
> does not see the corresponding partition id as been marked successful, it
> keeps retrying multiple times until the job finally succeeds. It doesn't
> cause any job failures because the DAG scheduler is tracking the partitions
> separate from the task set managers.
>
> Steps to Reproduce:
> # Run any large job involving shuffle operation.
> # When the ShuffleMap stage finishes and the ResultStage begins running,
> cause this stage to throw a fetch failure exception(Try deleting certain
> shuffle files on any host).
> # Observe the task attempt numbers for the next stage attempt. Please note
> that this issue is an intermittent one, so it might not happen all the time.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]