Ryan Williams created SPARK-9366:
------------------------------------
Summary: TaskEnd event emitted for task has different stage
attempt ID than TaskStart for same task
Key: SPARK-9366
URL: https://issues.apache.org/jira/browse/SPARK-9366
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.4.1
Reporter: Ryan Williams
During a simple job I ran yesterday, I observed the following in the event log:
{code}
{"Event":"SparkListenerTaskStart","Stage ID":0,"Stage Attempt ID":1,"Task
Info":{"Task ID":10244,"Index":55,"Attempt":1,"Launch
Time":1437767843724,"Executor
ID":"8","Host":"demeter-csmaz10-6.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":true,"Getting
Result Time":0,"Finish Time":1437767844387,"Failed":false,"Accumulables":[]}}
…
{"Event":"SparkListenerTaskEnd","Stage ID":0,"Stage Attempt ID":2,"Task
Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task
Info":{"Task ID":10244,"Index":55,"Attempt":1,"Launch
Time":1437767843724,"Executor
ID":"8","Host":"demeter-csmaz10-6.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":true,"Getting
Result Time":0,"Finish
Time":1437767844387,"Failed":false,"Accumulables":[]},"Task Metrics":{"Host
Name":"demeter-csmaz10-6.demeter.hpc.mssm.edu","Executor Deserialize
Time":63,"Executor Run Time":579,"Result Size":2235,"JVM GC Time":0,"Result
Serialization Time":1,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle
Write Metrics":{"Shuffle Bytes Written":2736,"Shuffle Write
Time":1388809,"Shuffle Records Written":100},"Input Metrics":{"Data Read
Method":"Network","Bytes Read":636000,"Records Read":100000}}}
{code}
The {{TaskStart}} event for task 10244 listed it (correctly) as coming from
stage 0, attempt 1, but the {{TaskEnd}} shows it as part of stage 0, attempt 2.
I'm pretty sure this is due to [this
line|https://github.com/apache/spark/blob/1efe97dc9ed31e3b8727b81be633b7e96dd3cd34/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L930]
in the DAGScheduler, which fills in the latest attempt ID for the task's
stage, instead of the attempt that the task actually belongs to.
I know there's a lot of flux right now around concurrent stage attempts and
attempt-id-tracking, but this seems trivial to fix independent of that so I'll
send a PR momentarily.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]