[ https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955085#comment-14955085 ]
Dr Stephen A Hellberg commented on SPARK-11066: ----------------------------------------------- Thanks for the clarification Sean. And, I've given my patches' comments a bit of a haircut... Sorry, I probably err on verbosity. (Ahem, some would likely consider that a stylistic failure ;-) ). I've also had a go at getting to grips with the dev/lint-scala tool applied to the codebase with my proposed (revised) patch, which passes now. > Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler > occasionally fails due to j.l.UnsupportedOperationException concerning a > finished JobWaiter > -------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-11066 > URL: https://issues.apache.org/jira/browse/SPARK-11066 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, Tests > Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: Multiple OS and platform types. > (Also observed by others, e.g. see External URL) > Reporter: Dr Stephen A Hellberg > Priority: Minor > > The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent > problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, > but whilst the job will fail and a SparkDriverExecutionException will be > returned, a race condition exists as to whether the first task's > (deliberately) thrown exception causes the job to fail - and having its > causing exception set to the DAGSchedulerSuiteDummyException that was thrown > as the setup of the misbehaving test - or second (and subsequent) tasks who > equally end, but have instead the DAGScheduler's legitimate > UnsupportedOperationException (a subclass of RuntimeException) returned > instead as their causing exception. This race condition is likely associated > with the vagaries of processing quanta, and expense of throwing two > exceptions (under interpreter execution) per thread of control; this race is > usually 'won' by the first task throwing the DAGSchedulerDummyException, as > desired (and expected)... but not always. > The problem for the testcase is that the first assertion is largely > concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest > expert) capture all the causes of SparkDriverExecutionException that can > legitimately arise from a correctly working (not crashed) DAGScheduler. > Arguably, this assertion might test something of the DAGScheduler... but not > all the possible outcomes for a working DAGScheduler. Nevertheless, this > test - when comprising a multiple task job - will report as a failure when in > fact the DAGScheduler is working-as-designed (and not crashed ;-). > Furthermore, the test is already failed before it actually tries to use the > SparkContext a second time (for an arbitrary processing task), which I think > is the real subject of the test? > The solution, I submit, is to ensure that the job is composed of just one > task, and that single task will result in the call to the compromised > ResultHandler causing the test's deliberate exception to be thrown and > exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by > the number of partitions of an RDD, this could be achieved with a single > partitioned RDD (indeed, doing so seems to exercise/would test some default > parallelism support of the TaskScheduler?); the pull request offered, > however, is based on the minimal change of just using a single partition of > the 2 (or more) partition parallelized RDD. This will result in scheduling a > job of just one task, one successful task calling the user-supplied > compromised ResultHandler function, which results in failing the job and > unambiguously wrapping our DAGSchedulerSuiteException inside a > SparkDriverExecutionException; there are no other tasks that on running > successfully will find the job failed causing the 'undesired' > UnsupportedOperationException to be thrown instead. This, then, satisfies > the test's setup assertion. > I have tested this hypothesis having parametised the number of partitions, N, > used by the "misbehaved ResultHandler" job and have observed the 1 x > DAGSchedulerSuiteException first, followed by the legitimate N-1 x > UnsupportedOperationExceptions ... what propagates back from the job seems to > simply become the result of the race between task threads and the > intermittent failures observed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org