[
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953241#comment-14953241
]
Apache Spark commented on SPARK-11066:
--------------------------------------
User 'shellberg' has created a pull request for this issue:
https://github.com/apache/spark/pull/9076
> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler
> occasionally fails due to j.l.UnsupportedOperationException concerning a
> finished JobWaiter
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-11066
> URL: https://issues.apache.org/jira/browse/SPARK-11066
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
> Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
> Reporter: Dr Stephen A Hellberg
> Priority: Minor
> Fix For: 1.5.1, 1.6.0
>
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks,
> but whilst the job will fail and a SparkDriverExecutionException will be
> returned, a race condition exists as to whether the first task's
> (deliberately) thrown exception causes the job to fail - and having its
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown
> as the setup of the misbehaving test - or second (and subsequent) tasks who
> equally end, but have instead the DAGScheduler's legitimate
> UnsupportedOperationException (a subclass of RuntimeException) returned
> instead as their causing exception. This race condition is likely associated
> with the vagaries of processing quanta, and expense of throwing two
> exceptions (under interpreter execution) per thread of control; this race is
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest
> expert) capture all the causes of SparkDriverExecutionException that can
> legitimately arise from a correctly working (not crashed) DAGScheduler.
> Arguably, this assertion might test something of the DAGScheduler... but not
> all the possible outcomes for a working DAGScheduler. Nevertheless, this
> test - when comprising a multiple task job - will report as a failure when in
> fact the DAGScheduler is working-as-designed (and not crashed ;-).
> Furthermore, the test is already failed before it actually tries to use the
> SparkContext a second time (for an arbitrary processing task), which I think
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one
> task, and that single task will result in the call to the compromised
> ResultHandler causing the test's deliberate exception to be thrown and
> exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by
> the number of partitions of an RDD, this could be achieved with a single
> partitioned RDD (indeed, doing so seems to exercise/would test some default
> parallelism support of the TaskScheduler?); the pull request offered,
> however, is based on the minimal change of just using a single partition of
> the 2 (or more) partition parallelized RDD. This will result in scheduling a
> job of just one task, one successful task calling the user-supplied
> compromised ResultHandler function, which results in failing the job and
> unambiguously wrapping our DAGSchedulerSuiteException inside a
> SparkDriverExecutionException; there are no other tasks that on running
> successfully will find the job failed causing the 'undesired'
> UnsupportedOperationException to be thrown instead. This, then, satisfies
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N,
> used by the "misbehaved ResultHandler" job and have observed the 1 x
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x
> UnsupportedOperationExceptions ... what propagates back from the job seems to
> simply become the result of the race between task threads and the
> intermittent failures observed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]