[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Dr Stephen A Hellberg (JIRA) Tue, 13 Oct 2015 03:03:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954736#comment-14954736
 ]


Dr Stephen A Hellberg commented on SPARK-11066:
-----------------------------------------------

Sean : apologies, re: Fix Version and Target Version.  I was led astray in 
interpreting their purpose given they were present on the Create issue template.
Fix Version makes complete sense: until the fix is integrated its not fixed; 
Target version... I'd interpreted that as where I'd hope to see the fix 
released/is suitable for being applied.  I know this issue arises in the 1.4.x 
release (and probably before) but I'm mostly interested in seeing this 
addressed in current/future releases; my fix is likely sufficient in prior 
releases equally, so what criteria is used to suggest how far back a committer 
would backport into prior releases (given only Affects Versions)?

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11066
>                 URL: https://issues.apache.org/jira/browse/SPARK-11066
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core, Tests
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>         Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>            Reporter: Dr Stephen A Hellberg
>            Priority: Minor
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the result of the race between task threads and the 
> intermittent failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Reply via email to