[
https://issues.apache.org/jira/browse/SPARK-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-10551.
----------------------------------
Resolution: Incomplete
> Successful task-end event after task failed due to executor loss
> ----------------------------------------------------------------
>
> Key: SPARK-10551
> URL: https://issues.apache.org/jira/browse/SPARK-10551
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.1
> Reporter: Ryan Williams
> Priority: Major
> Labels: bulk-closed
>
> Doing forensics on some failed Spark applications and seeing nonsensical
> things in the event logs, e.g.:
> {code}
> $ grep -n '"Task ID":12083' application_1439224376754_5702
> 24578:{"Event":"SparkListenerTaskStart","Stage ID":6,"Stage Attempt
> ID":0,"Task Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch
> Time":1440703704768,"Executor
> ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
> Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
> 28918:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task
> Type":"ShuffleMapTask","Task End
> Reason":{"Reason":"ExecutorLostFailure","Executor ID":"232"},"Task
> Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch
> Time":1440703704768,"Executor
> ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
> Result Time":0,"Finish Time":1440703707747,"Failed":true,"Accumulables":[]}}
> 29062:{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task
> Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task
> Info":{"Task ID":12083,"Index":145,"Attempt":0,"Launch
> Time":1440703704768,"Executor
> ID":"232","Host":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
> Result Time":0,"Finish
> Time":1440703707747,"Failed":true,"Accumulables":[]},"Task Metrics":{"Host
> Name":"demeter-csmaz11-11.demeter.hpc.mssm.edu","Executor Deserialize
> Time":181,"Executor Run Time":1585,"Result Size":8760,"JVM GC Time":0,"Result
> Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes
> Spilled":0,"Shuffle Write Metrics":{"Shuffle Bytes Written":454121,"Shuffle
> Write Time":43293396,"Shuffle Records Written":2549},"Input Metrics":{"Data
> Read Method":"Memory","Bytes Read":810520,"Records Read":2549}}}
> {code}
> Task ID 12083 has a TaskStart event, a TaskEnd event indicating that the task
> failed due to {{ExecutorLostFailure}}, and then a TaskEnd event saying that
> the task succeeded.
> The history server is not showing me this file in the "complete" or
> "incomplete" sections, though it has this line in its stdout (and no apparent
> exceptions later), which I thought meant that it parsed the file correctly:
> {code}
> 15/09/10 17:57:56 INFO FsHistoryProvider: Replaying log path:
> hdfs://demeter-nn1.demeter.hpc.mssm.edu/spark/tmp/logs/willir31/application_1439224376754_5702
> {code}
> [~arahuja] ran this application originally and says that the live web UI was
> showing inconsistent/nonsensical data.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]