pgandhi999 commented on issue #22806: [SPARK-25250][CORE] : On successful completion of a task attempt on a parti… URL: https://github.com/apache/spark/pull/22806#issuecomment-450911923 @cloud-fan I do not fully understand your question but this is what happens according to the current behaviour: Task with task id 5005(let us say) working on partition id 9000 is running on stage 4.0. Stage 4.0 fails, therefore, stage 4.1 is created. So stage 4.1 sees that the partition id 9000 has not completed and thus, it creates a new task with task id, let us say, 2002 scheduled to run on partition id 9000. However, it so happens that after stage 4.1 starts running, task id 5005 is successfully completed and it marks the partition id 9000 as complete. This information is not propagated to stage 4.1 as the map `successful` in TaskSetManager.scala which keeps track of the task indices that got completed does not get updated for tasksets running for stage 4.1. So, task id 2002 computes the result but is not able to commit it to HDFS as, results for the same partition have already been written to HDFS, thus, it sees a CommitDeniedException. However, it sees this as a failure and will keep retrying multiple times, even though it should not. This will continue on till the Spark application as a whole succeeds so the driver simply shuts down all the executors at the end, thus, killing task id 2002. I hope this answers your query.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
