Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6352#discussion_r31166517
  
    --- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -342,6 +342,29 @@ class DAGSchedulerSuite
         assert(locs === Seq(Seq("hostA", "hostB"), Seq("hostB", "hostC"), 
Seq("hostC", "hostD")))
       }
     
    +  /**
    +   * +---+ shuffle +---+    +---+    +---+
    +   * | A |<--------| B |<---| C |<---| D |
    +   * +---+         +---+    +---+    +---+
    +   * Here, D has one-to-one dependencies on C. C is derived from A by 
performing a shuffle
    +   * and then a map. If we're trying to determine which ancestor stages 
need to be computed in
    +   * order to compute D, we need to figure out whether the shuffle A -> B 
should be performed.
    +   * If the RDD C, which has only one ancestor via a narrow dependency, is 
cached, then we won't
    +   * need to compute A, even if it has some unavailable output partitions. 
The same goes for B:
    +   * if B is 100% cached, then we can avoid the shuffle on A.
    --- End diff --
    
    Josh's comment was an awesome description of how the dependencies should be 
computed, but isn't quite appropriate here as the comment for the test.  What 
about something like:
    
    This test ensures that if a particular RDD is cached, RDDs earlier in the 
dependency chain are not computed.  It constructs the following chain of 
dependencies:
    +---+ shuffle +---+    +---+    +---+
     | A |<--------| B |<---| C |<---| D |
    +---+         +---+    +---+    +---+
    Here, B is derived from A by performing a shuffle, C has a one-to-one 
dependency on B, and D similarly has a one-to-one dependency on C.  If none of 
the RDDs were cached, this set of RDDs would result in a two stage job: one 
ShuffleMapStage, and a ResultStage that reads the shuffled data from RDD A.  
This test ensures that if C is cached, the scheduler doesn't perform a shuffle, 
and instead computes the result using a single ResultStage that reads C's 
cached data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to