[
https://issues.apache.org/jira/browse/SPARK-10976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953260#comment-14953260
]
Dr Stephen A Hellberg commented on SPARK-10976:
-----------------------------------------------
Got further this time... although I was defeated by my inexperience reviewing
ScalaTest output!
My diagnostics were working and causing output in multiple DAGSchedulerSuite
tests, but the pass/fail for the test is produced at the end of the rlevant
test; hence the output above was the failing test's summary output, plus the
diagnostics exercised for the next (passing) test: "getPartitions exceptions".
Not very helpful.
Anyway, I could make much more sense of the behaviour once I started looking at
the correct set of diagnostics... and the relative order of exceptions that
start with the first with the DAGSchedulerDummyException, and then follow with
UnsupportedOperationException(s), which is exactly what we should expect in the
context of 'succeeding tasks' of a failed job. And, this reliably occurs... no
volatile state to be concerned with (or other strange initialisation states of
Scala objects); there is a race condition, but that concerns the propagation of
the re-thrown causing exception when wrapped into a
SparkDriverExecutionException. And, its also easily addressed if we can ensure
our job is simple enough not to consist of multiple tasks... See SPARK-11066
for my continuation.
> java.lang.UnsupportedOperationException: taskSucceeded() called on a finished
> JobWaiter
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-10976
> URL: https://issues.apache.org/jira/browse/SPARK-10976
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Has arisen in a variety of OSes, and platforms.
> It is highly intermittent, however, but annoying - we've seen it through
> 1.4.x and 1.5.x releases.
> My environment of current interest happens to be zLinux, which potentially
> represents a higher degree of concurrency than many others; I'm using an IBM
> Java 1.8.0, but this problem has been experienced on other environments, with
> other vendor's Java, e.g. see External URL
> Reporter: Dr Stephen A Hellberg
> Priority: Minor
>
> This issue is surfaced from the "misbehaved resultHandler should not crash
> DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite. I've
> been particularly trying to determine the causality for this problem, when it
> arises (as infrequently as it is), and surfacing some of the state
> transitions in the JobWaiter code responsible for throwing the
> j.l.UnsupportedOperationException.
> Of relevance, the UnsupportedOperationException is being thrown on the first
> occasion of the taskSucceded() being called (after object instantiation) and
> the executing thread throws the exception because it is finding _jobFinished
> to be 'true' - yes, before any of the tasks being waited upon have reported
> their success/failure. That is, _jobFinished (a volatile variable) is being
> perceived to be set true during object initialisation... as if its value
> is/was based on the boolean expression 'totalTask==0' (totalTask is one of
> the formal arguments of the class constructor). In fact, the right/correct
> values for the initial state of these variables during the relevant test of
> DAGSchedulerSuite intended is totalTask==2, and hence should be
> _jobFinished=false. We are apparently seeing a race condition amongst the
> read and write operations between what threads are doing; only the volatile
> annotation for _jobFinished is providing any thread safety?
> The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving
> a deliberately thrown exception: DAGSchedulerSuiteDummyException, from the
> ResultHandler function, albeit as a check on the setup of the test? Instead
> in our problem scenario, it _first_ captures the RuntimeException - the
> UnsupportedOperationException - produced from the (incompletely initialised?)
> JobWaiter code.
> The test suggests that the objective is that the DAGScheduler and
> SparkContext are 'not crashed'... it proceeds to conduct a count operation
> on the SparkContext, which both succeed... that is, neither are apparently
> crashed... which should be a positive outcome?
> It would be... except for this occasional RuntimeException to cloud the issue.
> (Is this deliberate.. or is this a deficiency of the current testcase?)
> - misbehaved resultHandler should not crash DAGScheduler and SparkContext ***
> FAILED ***
> java.lang.UnsupportedOperationException: taskSucceeded() called on a
> finished JobWaiter was not instance of
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> (DAGSchedulerSuite.scala:869)
> Failed: failing job... exception:
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> Succeeded: 0 (0 of 2)
> Succeeded: 1 (1 of 2)
> (My additional diagnostics presented here are minimal... I've surfaced the
> exception passed in the jobFailed() routine; and the index, finishedTasks,
> and (.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)
> I thought I was close - I still might be - to proposing a fix for this issue,
> although the intermittency of this issue is hampering my efforts.
> Nevertheless, I wanted to submit my hypothesis for any feedback.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]