Charles Allen created SPARK-19698:
-------------------------------------
Summary: Race condition in stale attempt task completion vs
current attempt task completion
Key: SPARK-19698
URL: https://issues.apache.org/jira/browse/SPARK-19698
Project: Spark
Issue Type: Bug
Components: Mesos, Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen
We have encountered a strange scenario in our production environment. Below is
the best guess we have right now as to what's going on.
Potentially, the final stage of a job has a failure in one of the tasks (such
as OOME on the executor) which can cause tasks for that stage to be relaunched
in a second attempt.
https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
keeps track of which tasks have been completed, but does NOT keep track of
which attempt those tasks were completed in. As such, we have encountered a
scenario where a particular task gets executed twice in different stage
attempts, and the DAGScheduler does not consider if the second attempt is still
running. This means if the first task attempt succeeded, the second attempt can
be cancelled part-way through its run cycle if all other tasks (including the
prior failed) are completed successfully.
What this means is that if a task is manipulating some state somewhere (for
example: a upload-to-temporary-file-location, then delete-then-move on an
underlying s3n storage implementation) the driver can improperly shutdown the
running (2nd attempt) task between state manipulations, leaving the persistent
state in a bad state since the 2nd attempt never got to complete its
manipulations, and was terminated prematurely at some arbitrary point in its
state change logic (ex: finished the delete but not the move).
This is using the mesos coarse grained executor. It is unclear if this behavior
is limited to the mesos coarse grained executor or not.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]