Ryan Williams created SPARK-9038:
------------------------------------
Summary: Missing TaskEnd event when task attempt is superseded by
another (speculative) attempt
Key: SPARK-9038
URL: https://issues.apache.org/jira/browse/SPARK-9038
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.4.1
Reporter: Ryan Williams
Yesterday I ran a job that produced [this event
log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0].
There are 17314 {{TaskStart}}'s and 17313 {{TaskEnd}}'s; task ID 15820 (aka
13.0.526.0) is missing a {{TaskEnd}} event.
A speculative second attempt, ID 16295 (13.0.526.1) finished before it; 15820
was the last taskattempt running in stage-attempt 13.0 and job 3, and when it
finished the latter two were each marked as succeeded.
At the conclusion of stage 13 / job 3, I observed a few things to be in
conflicting/inconsistent states:
*Reflecting 15820 as having finished successfully:*
* The "stage page" for 13.0 [showed "SUCCESS" in the "Status" column of the
per-task-attempt
table|http://cl.ly/image/2O0O42382p2W?_ga=1.265890767.118106744.1401937910].
* The driver stdout reported 15820's successful finish, and that it was being
ignored due to another attempt of the same task (16295, per above) having
already succeeded:
{code}
15/07/13 23:30:40 INFO scheduler.TaskSetManager: Ignoring task-finished event
for 526.0 in stage 13.0 because task 526 has already completed successfully
15/07/13 23:30:40 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose tasks
have all completed, from pool
15/07/13 23:30:40 INFO scheduler.DAGScheduler: Job 3 finished: collect at
JointHistogram.scala:107, took 579.659523 s
{code}
*Not reflecting 15820 as having finished at all:*
* As I mentioned before, [the event
log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0]
is missing a {{TaskEnd}} for 15820.
* The {{AllJobsPage}} shows 11258 tasks finished in job 3; it would have been
11259 with 15820.
** Additionally, inspecting the page in the DOM revealed a 1-task-wide sliver
of light-blue (i.e. running task(s)) in the progress bar.
** [This
screenshot|http://cl.ly/image/3O201z0e0G2C?_ga=1.265890767.118106744.1401937910]
shows both of these on the {{AllJobsPage}}.
* A history server, pointed at the event log, consistently shows 15820 as not
having finished.
** This is somewhat unsurprising given that the event log powering the history
server doesn't {{TaskEnd}} 15820, but seems notable nonetheless since the live
UI seemingly *did* partially record the task as having ended (cf. stage page
showing SUCCESS).
** Stage page shows 15820 as RUNNING.
** AllJobsPage shows 11258 tasks succeeded, 1 running.
I've gone over the relevant task-success code paths and can't understand how
the stage page would show me SUCCESS in the live UI, without anything having
been written to the event log or the AllJobsPage's counters having been
updated. [Here is a bunch of my driver
stdout|https://www.dropbox.com/s/pr7rswt4o2umm20/3010.stdout?dl=0], that shows
nothing abnormal afaict; and [the dreaded message about events being
dropped|https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40]
did not appear anywhere while the app was running, which was one of my only
guesses about how this could have happened (but which wouldn't fully explain
all of the above anyway).
Interested in hearing anyones' thoughts about how I might have arrived at this
inconsistent state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]