[GitHub] [spark] JoshRosen commented on a change in pull request #34265: [SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler

GitBox Wed, 13 Oct 2021 18:20:04 -0700


JoshRosen commented on a change in pull request #34265:
URL: https://github.com/apache/spark/pull/34265#discussion_r728557204




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -732,6 +732,35 @@ private[spark] class DAGScheduler(
     missing.toList
   }
 
+  /** Invoke `.partitions` on the given RDD and all of its ancestors  */
+  private def eagerlyComputePartitionsForRddAndAncestors(rdd: RDD[_]): Unit = {
+    val startTime = System.nanoTime
+    val visitedRdds = new HashSet[RDD[_]]
+    // We are manually maintaining a stack here to prevent StackOverflowError
+    // caused by recursively visiting
+    val waitingForVisit = new ListBuffer[RDD[_]]
+    waitingForVisit += rdd
+
+    def visit(rdd: RDD[_]): Unit = {
+      if (!visitedRdds(rdd)) {
+        visitedRdds += rdd
+
+        // Eagerly compute:
+        rdd.partitions

Review comment:
       I don't think so:
   
   Per the "Correctness: proving that we make no excess .partitions calls" in 
the PR description, I believe that the `DAGScheduler` will eventually call 
`.partitions` on every RDD in the DAG.
   
   If we only call `.partitions` on a subset of the RDDs encountered during our 
DAG traversal here then we run the risk that there could be an RRD whose 
partitions haven't been evaluated before the `DAGScheduler` calls 
`.partitions`, potentially leaving us vulnerable to the same performance 
problem.
   
   It _is_ true that many implementations of `getPartitions()` call 
`.partitions` on their parent RDDs, but there's no contract which _guarantees_ 
that in all cases. I think it's fine if we call `.partitions` on something 
which would have already been computed, though: it'll just [return the stored 
value](https://github.com/apache/spark/blob/5ac76d9cb45d58eeb4253d50e90060a68c3e87cb/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L289-L300).
 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen commented on a change in pull request #34265: [SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler

Reply via email to