[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-25250: ------------------------------------ Assignee: Apache Spark > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 2.3.1 > Reporter: Parth Gandhi > Assignee: Apache Spark > Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org