[ https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954736#comment-14954736 ]
Dr Stephen A Hellberg commented on SPARK-11066: ----------------------------------------------- Sean : apologies, re: Fix Version and Target Version. I was led astray in interpreting their purpose given they were present on the Create issue template. Fix Version makes complete sense: until the fix is integrated its not fixed; Target version... I'd interpreted that as where I'd hope to see the fix released/is suitable for being applied. I know this issue arises in the 1.4.x release (and probably before) but I'm mostly interested in seeing this addressed in current/future releases; my fix is likely sufficient in prior releases equally, so what criteria is used to suggest how far back a committer would backport into prior releases (given only Affects Versions)? > Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler > occasionally fails due to j.l.UnsupportedOperationException concerning a > finished JobWaiter > -------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-11066 > URL: https://issues.apache.org/jira/browse/SPARK-11066 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, Tests > Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: Multiple OS and platform types. > (Also observed by others, e.g. see External URL) > Reporter: Dr Stephen A Hellberg > Priority: Minor > > The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent > problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, > but whilst the job will fail and a SparkDriverExecutionException will be > returned, a race condition exists as to whether the first task's > (deliberately) thrown exception causes the job to fail - and having its > causing exception set to the DAGSchedulerSuiteDummyException that was thrown > as the setup of the misbehaving test - or second (and subsequent) tasks who > equally end, but have instead the DAGScheduler's legitimate > UnsupportedOperationException (a subclass of RuntimeException) returned > instead as their causing exception. This race condition is likely associated > with the vagaries of processing quanta, and expense of throwing two > exceptions (under interpreter execution) per thread of control; this race is > usually 'won' by the first task throwing the DAGSchedulerDummyException, as > desired (and expected)... but not always. > The problem for the testcase is that the first assertion is largely > concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest > expert) capture all the causes of SparkDriverExecutionException that can > legitimately arise from a correctly working (not crashed) DAGScheduler. > Arguably, this assertion might test something of the DAGScheduler... but not > all the possible outcomes for a working DAGScheduler. Nevertheless, this > test - when comprising a multiple task job - will report as a failure when in > fact the DAGScheduler is working-as-designed (and not crashed ;-). > Furthermore, the test is already failed before it actually tries to use the > SparkContext a second time (for an arbitrary processing task), which I think > is the real subject of the test? > The solution, I submit, is to ensure that the job is composed of just one > task, and that single task will result in the call to the compromised > ResultHandler causing the test's deliberate exception to be thrown and > exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by > the number of partitions of an RDD, this could be achieved with a single > partitioned RDD (indeed, doing so seems to exercise/would test some default > parallelism support of the TaskScheduler?); the pull request offered, > however, is based on the minimal change of just using a single partition of > the 2 (or more) partition parallelized RDD. This will result in scheduling a > job of just one task, one successful task calling the user-supplied > compromised ResultHandler function, which results in failing the job and > unambiguously wrapping our DAGSchedulerSuiteException inside a > SparkDriverExecutionException; there are no other tasks that on running > successfully will find the job failed causing the 'undesired' > UnsupportedOperationException to be thrown instead. This, then, satisfies > the test's setup assertion. > I have tested this hypothesis having parametised the number of partitions, N, > used by the "misbehaved ResultHandler" job and have observed the 1 x > DAGSchedulerSuiteException first, followed by the legitimate N-1 x > UnsupportedOperationExceptions ... what propagates back from the job seems to > simply become the result of the race between task threads and the > intermittent failures observed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org