Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/6352#discussion_r31065709
--- Diff:
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -342,6 +342,35 @@ class DAGSchedulerSuite
assert(locs === Seq(Seq("hostA", "hostB"), Seq("hostB", "hostC"),
Seq("hostC", "hostD")))
}
+ /**
+ * +---+ shuffle +---+ +---+
+ * | A |<--------| B |<---| C |<--+
+ * +---+ +---+ +---+ | +---+
+ * +--| E |
+ * +---+ | +---+
+ * | D |<--+
+ * +---+
+ * Here, E has one-to-one dependencies on C and D. C is derived from A
by performing a shuffle
+ * and then a map. If we're trying to determine which ancestor stages
need to be computed in
+ * order to compute E, we need to figure out whether the shuffle A -> B
should be performed.
+ * If the RDD C, which has only one ancestor via a narrow dependency, is
cached, then we won't
+ * need to compute A, even if it has some unavailable output partitions.
The same goes for B:
+ * if B is 100% cached, then we can avoid the shuffle on A.
+ */
+ test("SPARK-7826: regression test for getMissingParentStages") {
--- End diff --
Good call on the test naming change. This test was derived from [a
comment](https://github.com/apache/spark/pull/6352#issuecomment-104847514) that
I left upthread which described a case in which the logic in an earlier version
of this patch was incorrect.
An older version of this patch would end up skipping the check to see
whether `C` was cached, which could cause the scheduler to mistakenly think
that the shuffle from A to B would need to be performed. The earlier version
of the patch passed all tests despite this logic error, so we added this test
to try to exercise that case.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]