Dr Stephen A Hellberg created SPARK-11066:
---------------------------------------------
Summary: Flaky test
o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due
to j.l.UnsupportedOperationException concerning a finished JobWaiter
Key: SPARK-11066
URL: https://issues.apache.org/jira/browse/SPARK-11066
Project: Spark
Issue Type: Bug
Components: Scheduler, Spark Core
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0
Environment: Multiple OS and platform types.
(Also observed by others, e.g. see External URL)
Reporter: Dr Stephen A Hellberg
Priority: Minor
Fix For: 1.6.0, 1.5.1
The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent
problem: it creates a job for the DAGScheduler comprising multiple (2) tasks,
but whilst the job will fail and a SparkDriverExecutionException will be
returned, a race condition exists as to whether the first task's (deliberately)
thrown exception causes the job to fail - and having its causing exception set
to the DAGSchedulerSuiteDummyException that was thrown as the setup of the
misbehaving test - or second (and subsequent) tasks who equally end, but have
instead the DAGScheduler's legitimate UnsupportedOperationException (a subclass
of RuntimeException) returned instead as their causing exception. This race
condition is likely associated with the vagaries of processing quanta, and
expense of throwing two exceptions (under interpreter execution) per thread of
control; this race is usually 'won' by the first task throwing the
DAGSchedulerDummyException, as desired (and expected)... but not always.
The problem for the testcase is that the first assertion is largely concerning
the test setup, and doesn't (can't? Sorry, still not a ScalaTest expert)
capture all the causes of SparkDriverExecutionException that can legitimately
arise from a correctly working (not crashed) DAGScheduler. Arguably, this
assertion might test something of the DAGScheduler... but not all the possible
outcomes for a working DAGScheduler. Nevertheless, this test - when comprising
a multiple task job - will report as a failure when in fact the DAGScheduler is
working-as-designed (and not crashed ;-). Furthermore, the test is already
failed before it actually tries to use the SparkContext a second time (for an
arbitrary processing task), which I think is the real subject of the test?
The solution, I submit, is to ensure that the job is composed of just one task,
and that single task will result in the call to the compromised ResultHandler
causing the test's deliberate exception to be thrown and exercising the
relevant (DAGScheduler) code paths. Given tasks are scoped by the number of
partitions of an RDD, this could be achieved with a single partitioned RDD
(indeed, doing so seems to exercise/would test some default parallelism support
of the TaskScheduler?); the pull request offered, however, is based on the
minimal change of just using a single partition of the 2 (or more) partition
parallelized RDD. This will result in scheduling a job of just one task, one
successful task calling the user-supplied compromised ResultHandler function,
which results in failing the job and unambiguously wrapping our
DAGSchedulerSuiteException inside a SparkDriverExecutionException; there are no
other tasks that on running successfully will find the job failed causing the
'undesired' UnsupportedOperationException to be thrown instead. This, then,
satisfies the test's setup assertion.
I have tested this hypothesis having parametised the number of partitions, N,
used by the "misbehaved ResultHandler" job and have observed the 1 x
DAGSchedulerSuiteException first, followed by the legitimate N-1 x
UnsupportedOperationExceptions ... what propagates back from the job seems to
simply become the result of the race between task threads and the intermittent
failures observed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]