[GitHub] spark pull request: [SPARK-7826][CORE] Suppress extra calling getC...

kayousterhout Tue, 26 May 2015 19:55:33 -0700

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6352#discussion_r31099674
  
    --- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -342,6 +342,35 @@ class DAGSchedulerSuite
         assert(locs === Seq(Seq("hostA", "hostB"), Seq("hostB", "hostC"), 
Seq("hostC", "hostD")))
       }
     
    +  /**
    +   * +---+ shuffle +---+    +---+
    +   * | A |<--------| B |<---| C |<--+
    +   * +---+         +---+    +---+   |  +---+
    +   *                                +--| E |
    +   *                        +---+   |  +---+
    +   *                        | D |<--+
    +   *                        +---+
    +   * Here, E has one-to-one dependencies on C and D. C is derived from A 
by performing a shuffle
    +   * and then a map. If we're trying to determine which ancestor stages 
need to be computed in
    +   * order to compute E, we need to figure out whether the shuffle A -> B 
should be performed.
    +   * If the RDD C, which has only one ancestor via a narrow dependency, is 
cached, then we won't
    +   * need to compute A, even if it has some unavailable output partitions. 
The same goes for B:
    +   * if B is 100% cached, then we can avoid the shuffle on A.
    +   */
    +  test("SPARK-7826: regression test for getMissingParentStages") {
    +    val rddA = new MyRDD(sc, 1, Nil)
    +    val rddB = new MyRDD(sc, 1, List(new ShuffleDependency(rddA, null)))
    +    val rddC = new MyRDD(sc, 1, List(new OneToOneDependency(rddB))).cache()
    +    val rddD = new MyRDD(sc, 1, Nil)
    +    val rddE = new MyRDD(sc, 1,
    +      List(new OneToOneDependency(rddC), new OneToOneDependency(rddD)))
    +    cacheLocations(rddC.id -> 0) =
    +      Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))
    +    val jobId = submit(rddE, Array(0))
    +    val finalStage = scheduler.jobIdToActiveJob(jobId).finalStage
    +    assert(scheduler.getMissingParentStages(finalStage).size === 0)
    --- End diff --
    
    I was thinking you could inspect the contents of the stages in 
runningStages to make sure the Id is correct
    
    Sent from my iPhone
    
    > On May 26, 2015, at 7:53 PM, Takuya UESHIN <[email protected]> 
wrote:
    > 
    > In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
    > 
    > > +   * If the RDD C, which has only one ancestor via a narrow 
dependency, is cached, then we won't
    > > +   * need to compute A, even if it has some unavailable output 
partitions. The same goes for B:
    > > +   * if B is 100% cached, then we can avoid the shuffle on A.
    > > +   */
    > > +  test("SPARK-7826: regression test for getMissingParentStages") {
    > > +    val rddA = new MyRDD(sc, 1, Nil)
    > > +    val rddB = new MyRDD(sc, 1, List(new ShuffleDependency(rddA, 
null)))
    > > +    val rddC = new MyRDD(sc, 1, List(new 
OneToOneDependency(rddB))).cache()
    > > +    val rddD = new MyRDD(sc, 1, Nil)
    > > +    val rddE = new MyRDD(sc, 1,
    > > +      List(new OneToOneDependency(rddC), new OneToOneDependency(rddD)))
    > > +    cacheLocations(rddC.id -> 0) =
    > > +      Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))
    > > +    val jobId = submit(rddE, Array(0))
    > > +    val finalStage = scheduler.jobIdToActiveJob(jobId).finalStage
    > > +    assert(scheduler.getMissingParentStages(finalStage).size === 0)
    > Ah, I found that only checking if the DAGScheduler.runningStages contains 
one stage is not enough because it also contains one stage including A if the C 
is not cached yet.
    > I think we should also check the size of the final stage's missing 
parents.
    > 
    > â
    > Reply to this email directly or view it on GitHub.
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7826][CORE] Suppress extra calling getC...

Reply via email to