[ 
https://issues.apache.org/jira/browse/SPARK-10976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dr Stephen A Hellberg closed SPARK-10976.
-----------------------------------------
    Resolution: Invalid

Cancelling my first report and investigation of this problem based on a 
spurious hypothesis.  My continuation is much better reasoned.

> java.lang.UnsupportedOperationException: taskSucceeded() called on a finished 
> JobWaiter
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-10976
>                 URL: https://issues.apache.org/jira/browse/SPARK-10976
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>         Environment: Has arisen in a variety of OSes, and platforms.
> It is highly intermittent, however, but annoying - we've seen it through 
> 1.4.x and 1.5.x releases.
> My environment of current interest happens to be zLinux, which potentially 
> represents a higher degree of concurrency than many others; I'm using an IBM 
> Java 1.8.0, but this problem has been experienced on other environments, with 
> other vendor's Java, e.g. see External URL
>            Reporter: Dr Stephen A Hellberg
>            Priority: Minor
>
> This issue is surfaced from the "misbehaved resultHandler should not crash 
> DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite.  I've 
> been particularly trying to determine the causality for this problem, when it 
> arises (as infrequently as it is), and surfacing some of the state 
> transitions in the JobWaiter code responsible for throwing the 
> j.l.UnsupportedOperationException.
> Of relevance, the UnsupportedOperationException is being thrown on the first 
> occasion of the taskSucceded() being called (after object instantiation) and 
> the executing thread throws the exception because it is finding _jobFinished 
> to be 'true' - yes, before any of the tasks being waited upon have reported 
> their success/failure.  That is, _jobFinished (a volatile variable) is being 
> perceived to be set true during object initialisation... as if its value 
> is/was based on the boolean expression 'totalTask==0' (totalTask is one of 
> the formal arguments of the class constructor).  In fact, the right/correct 
> values for the initial state of these variables during the relevant test of 
> DAGSchedulerSuite intended is totalTask==2, and hence should be 
> _jobFinished=false.  We are apparently seeing a race condition amongst the 
> read and write operations between what threads are doing; only the volatile 
> annotation for _jobFinished is providing any thread safety?
> The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving 
> a deliberately thrown exception: DAGSchedulerSuiteDummyException, from the 
> ResultHandler function, albeit as a check on the setup of the test?  Instead 
> in our problem scenario, it _first_ captures the RuntimeException - the 
> UnsupportedOperationException - produced from the (incompletely initialised?) 
> JobWaiter code.
> The test suggests that the objective is that the DAGScheduler and 
> SparkContext are 'not crashed'...  it proceeds to conduct a count operation 
> on the SparkContext, which both succeed... that is, neither are apparently 
> crashed... which should be a positive outcome?
> It would be... except for this occasional RuntimeException to cloud the issue.
> (Is this deliberate.. or is this a deficiency of the current testcase?)
> - misbehaved resultHandler should not crash DAGScheduler and SparkContext *** 
> FAILED ***
>   java.lang.UnsupportedOperationException: taskSucceeded() called on a 
> finished JobWaiter was not instance of 
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException 
> (DAGSchedulerSuite.scala:869)
> Failed: failing job... exception: 
> org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
> Succeeded: 0 (0 of 2)
> Succeeded: 1 (1 of 2)
> (My additional diagnostics presented here are minimal... I've surfaced the 
> exception passed in the jobFailed() routine; and the index, finishedTasks, 
> and (.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)
> I thought I was close - I still might be - to proposing a fix for this issue, 
> although the intermittency of this issue is hampering my efforts.  
> Nevertheless, I wanted to submit my hypothesis for any feedback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to