[
https://issues.apache.org/jira/browse/SPARK-10976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dr Stephen A Hellberg closed SPARK-10976.
-----------------------------------------
Resolution: Invalid
Cancelling my first report and investigation of this problem based on a
spurious hypothesis. My continuation is much better reasoned.
> java.lang.UnsupportedOperationException: taskSucceeded() called on a finished
> JobWaiter
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-10976
> URL: https://issues.apache.org/jira/browse/SPARK-10976
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Has arisen in a variety of OSes, and platforms.
> It is highly intermittent, however, but annoying - we've seen it through
> 1.4.x and 1.5.x releases.
> My environment of current interest happens to be zLinux, which potentially
> represents a higher degree of concurrency than many others; I'm using an IBM
> Java 1.8.0, but this problem has been experienced on other environments, with
> other vendor's Java, e.g. see External URL
> Reporter: Dr Stephen A Hellberg
> Priority: Minor
>
> This issue is surfaced from the "misbehaved resultHandler should not crash
> DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite. I've
> been particularly trying to determine the causality for this problem, when it
> arises (as infrequently as it is), and surfacing some of the state
> transitions in the JobWaiter code responsible for throwing the
> j.l.UnsupportedOperationException.
> Of relevance, the UnsupportedOperationException is being thrown on the first
> occasion of the taskSucceded() being called (after object instantiation) and
> the executing thread throws the exception because it is finding _jobFinished
> to be 'true' - yes, before any of the tasks being waited upon have reported
> their success/failure. That is, _jobFinished (a volatile variable) is being
> perceived to be set true during object initialisation... as if its value
> is/was based on the boolean expression 'totalTask==0' (totalTask is one of
> the formal arguments of the class constructor). In fact, the right/correct
> values for the initial state of these variables during the relevant test of
> DAGSchedulerSuite intended is totalTask==2, and hence should be
> _jobFinished=false. We are apparently seeing a race condition amongst the
> read and write operations between what threads are doing; only the volatile
> annotation for _jobFinished is providing any thread safety?
> The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving
> a deliberately thrown exception: DAGSchedulerSuiteDummyException, from the
> ResultHandler function, albeit as a check on the setup of the test? Instead
> in our problem scenario, it _first_ captures the RuntimeException - the
> UnsupportedOperationException - produced from the (incompletely initialised?)
> JobWaiter code.
> The test suggests that the objective is that the DAGScheduler and
> SparkContext are 'not crashed'... it proceeds to conduct a count operation
> on the SparkContext, which both succeed... that is, neither are apparently
> crashed... which should be a positive outcome?
> It would be... except for this occasional RuntimeException to cloud the issue.
> (Is this deliberate.. or is this a deficiency of the current testcase?)
> - misbehaved resultHandler should not crash DAGScheduler and SparkContext ***
> FAILED ***
> java.lang.UnsupportedOperationException: taskSucceeded() called on a
> finished JobWaiter was not instance of
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> (DAGSchedulerSuite.scala:869)
> Failed: failing job... exception:
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> Succeeded: 0 (0 of 2)
> Succeeded: 1 (1 of 2)
> (My additional diagnostics presented here are minimal... I've surfaced the
> exception passed in the jobFailed() routine; and the index, finishedTasks,
> and (.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)
> I thought I was close - I still might be - to proposing a fix for this issue,
> although the intermittency of this issue is hampering my efforts.
> Nevertheless, I wanted to submit my hypothesis for any feedback.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]