Charles Allen created SPARK-19698:
-------------------------------------

             Summary: Race condition in stale attempt task completion vs 
current attempt task completion
                 Key: SPARK-19698
                 URL: https://issues.apache.org/jira/browse/SPARK-19698
             Project: Spark
          Issue Type: Bug
          Components: Mesos, Spark Core
    Affects Versions: 2.0.0
            Reporter: Charles Allen


We have encountered a strange scenario in our production environment. Below is 
the best guess we have right now as to what's going on.

Potentially, the final stage of a job has a failure in one of the tasks (such 
as OOME on the executor) which can cause tasks for that stage to be relaunched 
in a second attempt.

https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155

keeps track of which tasks have been completed, but does NOT keep track of 
which attempt those tasks were completed in. As such, we have encountered a 
scenario where a particular task gets executed twice in different stage 
attempts, and the DAGScheduler does not consider if the second attempt is still 
running. This means if the first task attempt succeeded, the second attempt can 
be cancelled part-way through its run cycle if all other tasks (including the 
prior failed) are completed successfully.

What this means is that if a task is manipulating some state somewhere (for 
example: a upload-to-temporary-file-location, then delete-then-move on an 
underlying s3n storage implementation) the driver can improperly shutdown the 
running (2nd attempt) task between state manipulations, leaving the persistent 
state in a bad state since the 2nd attempt never got to complete its 
manipulations, and was terminated prematurely at some arbitrary point in its 
state change logic (ex: finished the delete but not the move).

This is using the mesos coarse grained executor. It is unclear if this behavior 
is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to