[
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511695#comment-16511695
]
Jiang Xingbo edited comment on SPARK-24552 at 6/13/18 9:47 PM:
---------------------------------------------------------------
IIUC stageAttemptId + taskAttemptNumber shall probably define a unique task
attempt, and it carries enough information to know how many failed attempts you
had previously.
was (Author: jiangxb1987):
IIUC stageAttemptId + taskAttemptId shall probably define a unique task
attempt, and it carries enough information to know how many failed attempts you
had previously.
> Task attempt numbers are reused when stages are retried
> -------------------------------------------------------
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.1
> Reporter: Ryan Blue
> Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so
> that writers can use the attempt number to track and clean up data from
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the
> attempt number's javadoc states that "Implementations can use this attempt
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two
> tasks can create the same data and attempt to commit. The commit coordinator
> prevents both from committing and will abort the attempt that finishes last.
> When using the (partition, attempt) pair to track data, the aborted task may
> delete data associated with the (partition, attempt) pair. If that happens,
> the data for the task that committed is also deleted as well, which is a
> correctness bug.
> For a concrete example, I have a data source that creates files in place
> named with {{part-<partition>-<attempt>-<uuid>.<format>}}. Because these
> files are written in place, both tasks create the same file and the one that
> is aborted deletes the file, leading to data corruption when the file is
> added to the table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]