Dr Stephen A Hellberg created SPARK-10976:
---------------------------------------------
Summary: java.lang.UnsupportedOperationException: taskSucceeded()
called on a finished JobWaiter
Key: SPARK-10976
URL: https://issues.apache.org/jira/browse/SPARK-10976
Project: Spark
Issue Type: Bug
Components: Scheduler, Spark Core
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0
Environment: Has arisen in a variety of OSes, and platforms.
It is highly intermittent, however, but annoying - we've seen it through 1.4.x
and 1.5.x releases.
My environment of current interest happens to be zLinux, which potentially
represents a higher degree of concurrency than many others; I'm using an IBM
Java 1.8.0, but this problem has been experienced on other environments, with
other vendor's Java, e.g. see External URL
Reporter: Dr Stephen A Hellberg
Priority: Minor
This issue is surfaced from the "misbehaved resultHandler should not crash
DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite. I've been
particularly trying to determine the causality for this problem, when it arises
(as infrequently as it is), and surfacing some of the state transitions in the
JobWaiter code responsible for throwing the j.l.UnsupportedOperationException.
Of relevance, the UnsupportedOperationException is being thrown on the first
occasion of the taskSucceded() being called (after object instantiation) and
the executing thread throws the exception because it is finding _jobFinished to
be 'true' - yes, before any of the tasks being waited upon have reported their
success/failure. That is, _jobFinished (a volatile variable) is being
perceived to be set true during object initialisation... as if its value is/was
based on the boolean expression 'totalTask==0' (totalTask is one of the formal
arguments of the class constructor). In fact, the right/correct values for the
initial state of these variables during the relevant test of DAGSchedulerSuite
intended is totalTask==2, and hence should be _jobFinished=false. We are
apparently seeing a race condition amongst the read and write operations
between what threads are doing; only the volatile annotation for _jobFinished
is providing any thread safety?
The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving a
deliberately thrown exception: DAGSchedulerSuiteDummyException, from the
ResultHandler function, albeit as a check on the setup of the test? Instead in
our problem scenario, it _first_ captures the RuntimeException - the
UnsupportedOperationException - produced from the (incompletely initialised?)
JobWaiter code.
The test suggests that the objective is that the DAGScheduler and SparkContext
are 'not crashed'... it proceeds to conduct a count operation on the
SparkContext, which both succeed... that is, neither are apparently crashed...
which should be a positive outcome?
It would be... except for this occasional RuntimeException to cloud the issue.
(Is this deliberate.. or is this a deficiency of the current testcase?)
- misbehaved resultHandler should not crash DAGScheduler and SparkContext ***
FAILED ***
java.lang.UnsupportedOperationException: taskSucceeded() called on a finished
JobWaiter was not instance of
org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
(DAGSchedulerSuite.scala:869)
Failed: failing job... exception:
org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
Succeeded: 0 (0 of 2)
Succeeded: 1 (1 of 2)
(My additional diagnostics presented here are minimal... I've surfaced the
exception passed in the jobFailed() routine; and the index, finishedTasks, and
(.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)
I thought I was close - I still might be - to proposing a fix for this issue,
although the intermittency of this issue is hampering my efforts.
Nevertheless, I wanted to submit my hypothesis for any feedback.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]