[
https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-26682:
------------------------------------
Assignee: Apache Spark
> Task attempt ID collision causes lost data
> ------------------------------------------
>
> Key: SPARK-26682
> URL: https://issues.apache.org/jira/browse/SPARK-26682
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.1.3, 2.3.2, 2.4.0
> Reporter: Ryan Blue
> Assignee: Apache Spark
> Priority: Major
>
> We recently tracked missing data to a collision in the fake Hadoop task
> attempt ID created when using Hadoop OutputCommitters. This is similar to
> SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a
> FetchFailedException and Spark resubmitted the stage. Because only one task
> was affected, the original stage attempt continued running tasks that had
> been resubmitted. Another task ran two attempts concurrently on the same
> executor, but had the same attempt number because they were from different
> stage attempts. Because the attempt number was the same, the task used the
> same temp locations. That caused one attempt to fail because a file path
> already existed, and that attempt then removed the shared temp location and
> deleted the other task's data. When the second attempt succeeded, it
> committed partial data.
> The problem was that both attempts had the same partition and attempt
> numbers, despite being run in different stages, and that was used to create a
> Hadoop task attempt ID on which the temp location was based. The fix is to
> use Spark's global task attempt ID, which is a counter, instead of attempt
> number because attempt number is reused in stage attempts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]