[jira] [Created] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Dr Stephen A Hellberg (JIRA) Mon, 12 Oct 2015 08:39:50 -0700

Dr Stephen A Hellberg created SPARK-11066:
---------------------------------------------


             Summary: Flaky test 
o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due 
to j.l.UnsupportedOperationException concerning a finished JobWaiter
                 Key: SPARK-11066
                 URL: https://issues.apache.org/jira/browse/SPARK-11066
             Project: Spark
          Issue Type: Bug
          Components: Scheduler, Spark Core
    Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0
         Environment: Multiple OS and platform types.
(Also observed by others, e.g. see External URL)
            Reporter: Dr Stephen A Hellberg
            Priority: Minor
             Fix For: 1.6.0, 1.5.1


The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
but whilst the job will fail and a SparkDriverExecutionException will be 
returned, a race condition exists as to whether the first task's (deliberately) 
thrown exception causes the job to fail - and having its causing exception set 
to the DAGSchedulerSuiteDummyException that was thrown as the setup of the 
misbehaving test - or second (and subsequent) tasks who equally end, but have 
instead the DAGScheduler's legitimate UnsupportedOperationException (a subclass 
of RuntimeException) returned instead as their causing exception.  This race 
condition is likely associated with the vagaries of processing quanta, and 
expense of throwing two exceptions (under interpreter execution) per thread of 
control; this race is usually 'won' by the first task throwing the 
DAGSchedulerDummyException, as desired (and expected)... but not always.

The problem for the testcase is that the first assertion is largely concerning 
the test setup, and doesn't (can't? Sorry, still not a ScalaTest expert) 
capture all the causes of SparkDriverExecutionException that can legitimately 
arise from a correctly working (not crashed) DAGScheduler.  Arguably, this 
assertion might test something of the DAGScheduler... but not all the possible 
outcomes for a working DAGScheduler.  Nevertheless, this test - when comprising 
a multiple task job - will report as a failure when in fact the DAGScheduler is 
working-as-designed (and not crashed ;-).  Furthermore, the test is already 
failed before it actually tries to use the SparkContext a second time (for an 
arbitrary processing task), which I think is the real subject of the test?

The solution, I submit, is to ensure that the job is composed of just one task, 
and that single task will result in the call to the compromised ResultHandler 
causing the test's deliberate exception to be thrown and exercising the 
relevant (DAGScheduler) code paths.  Given tasks are scoped by the number of 
partitions of an RDD, this could be achieved with a single partitioned RDD 
(indeed, doing so seems to exercise/would test some default parallelism support 
of the TaskScheduler?); the pull request offered, however, is based on the 
minimal change of just using a single partition of the 2 (or more) partition 
parallelized RDD.  This will result in scheduling a job of just one task, one 
successful task calling the user-supplied compromised ResultHandler function, 
which results in failing the job and unambiguously wrapping our 
DAGSchedulerSuiteException inside a SparkDriverExecutionException; there are no 
other tasks that on running successfully will find the job failed causing the 
'undesired' UnsupportedOperationException to be thrown instead.  This, then, 
satisfies the test's setup assertion.

I have tested this hypothesis having parametised the number of partitions, N, 
used by the "misbehaved ResultHandler" job and have observed the 1 x 
DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
UnsupportedOperationExceptions ... what propagates back from the job seems to 
simply become the result of the race between task threads and the intermittent 
failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Reply via email to