[
https://issues.apache.org/jira/browse/SPARK-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633723#comment-14633723
]
Ryan Williams commented on SPARK-9038:
--------------------------------------
Note: this also causes Spark to think that these executors are still running
tasks and can't be dynamically un-allocated.
> Missing TaskEnd event when task attempt is superseded by another
> (speculative) attempt
> --------------------------------------------------------------------------------------
>
> Key: SPARK-9038
> URL: https://issues.apache.org/jira/browse/SPARK-9038
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.1
> Reporter: Ryan Williams
>
> Yesterday I ran a job that produced [this event
> log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0].
> There are 17314 {{TaskStart}}'s and 17313 {{TaskEnd}}'s; task ID 15820 (aka
> 13.0.526.0) is missing a {{TaskEnd}} event.
> A speculative second attempt, ID 16295 (13.0.526.1) finished before it; 15820
> was the last taskattempt running in stage-attempt 13.0 and job 3, and when it
> finished the latter two were each marked as succeeded.
> At the conclusion of stage 13 / job 3, I observed a few things to be in
> conflicting/inconsistent states:
> *Reflecting 15820 as having finished successfully:*
> * The "stage page" for 13.0 [showed "SUCCESS" in the "Status" column of the
> per-task-attempt
> table|http://cl.ly/image/2O0O42382p2W?_ga=1.265890767.118106744.1401937910].
> * The driver stdout reported 15820's successful finish, and that it was being
> ignored due to another attempt of the same task (16295, per above) having
> already succeeded:
> {code}
> 15/07/13 23:30:40 INFO scheduler.TaskSetManager: Ignoring task-finished event
> for 526.0 in stage 13.0 because task 526 has already completed successfully
> 15/07/13 23:30:40 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose
> tasks have all completed, from pool
> 15/07/13 23:30:40 INFO scheduler.DAGScheduler: Job 3 finished: collect at
> JointHistogram.scala:107, took 579.659523 s
> {code}
> *Not reflecting 15820 as having finished at all:*
> * As I mentioned before, [the event
> log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0]
> is missing a {{TaskEnd}} for 15820.
> * The {{AllJobsPage}} shows 11258 tasks finished in job 3; it would have been
> 11259 with 15820.
> ** Additionally, inspecting the page in the DOM revealed a 1-task-wide sliver
> of light-blue (i.e. running task(s)) in the progress bar.
> ** [This
> screenshot|http://cl.ly/image/3O201z0e0G2C?_ga=1.265890767.118106744.1401937910]
> shows both of these on the {{AllJobsPage}}.
> * A history server, pointed at the event log, consistently shows 15820 as not
> having finished.
> ** This is somewhat unsurprising given that the event log powering the
> history server doesn't {{TaskEnd}} 15820, but seems notable nonetheless since
> the live UI seemingly *did* partially record the task as having ended (cf.
> stage page showing SUCCESS).
> ** Stage page shows 15820 as RUNNING.
> ** AllJobsPage shows 11258 tasks succeeded, 1 running.
> I've gone over the relevant task-success code paths and can't understand how
> the stage page would show me SUCCESS in the live UI, without anything having
> been written to the event log or the AllJobsPage's counters having been
> updated. [Here is a bunch of my driver
> stdout|https://www.dropbox.com/s/pr7rswt4o2umm20/3010.stdout?dl=0], that
> shows nothing abnormal afaict; and [the dreaded message about events being
> dropped|https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40]
> did not appear anywhere while the app was running, which was one of my only
> guesses about how this could have happened (but which wouldn't fully explain
> all of the above anyway).
> Interested in hearing anyones' thoughts about how I might have arrived at
> this inconsistent state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]