[GitHub] [spark] JoshRosen commented on a change in pull request #34265: [SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler

GitBox Wed, 13 Oct 2021 20:55:50 -0700


JoshRosen commented on a change in pull request #34265:
URL: https://github.com/apache/spark/pull/34265#discussion_r728615460




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -732,6 +732,35 @@ private[spark] class DAGScheduler(
     missing.toList
   }
 
+  /** Invoke `.partitions` on the given RDD and all of its ancestors  */
+  private def eagerlyComputePartitionsForRddAndAncestors(rdd: RDD[_]): Unit = {
+    val startTime = System.nanoTime
+    val visitedRdds = new HashSet[RDD[_]]
+    // We are manually maintaining a stack here to prevent StackOverflowError
+    // caused by recursively visiting
+    val waitingForVisit = new ListBuffer[RDD[_]]
+    waitingForVisit += rdd
+
+    def visit(rdd: RDD[_]): Unit = {
+      if (!visitedRdds(rdd)) {
+        visitedRdds += rdd
+
+        // Eagerly compute:
+        rdd.partitions
+
+        for (dep <- rdd.dependencies) {
+          waitingForVisit.prepend(dep.rdd)
+        }
+      }
+    }
+
+    while (waitingForVisit.nonEmpty) {
+      visit(waitingForVisit.remove(0))
+    }
+    logDebug("eagerlyComputePartitionsForRddAndAncestors for RDD %d took %f 
seconds"

Review comment:
       To clarify, I think we should tackle user-facing logging in a separate 
PR, since it involves some design decisions that I'd like to consider 
separately from this fix.
   
   I think the goal of INFO logging would be to help users explain unexpected 
pauses in their driver code. A slow `.partitions` call is one source of pauses, 
but those `.partitions` calls often occur before 
`eagerlyComputePartitionsForRddAndAncestors` is called. As a result, I think 
that user-facing warning logging should be added [in RDD.partitions 
itself](https://github.com/apache/spark/blob/d9b4cc65b89d8497dc4e315395f2c98cc4ac9327/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L292)
 and not here in `eagerlyComputePartitionsForRddAndAncestors`.
   
   Once we've done that, I'm not sure if there's additional value in promoting 
the `eagerlyComputePartitionsForRddAndAncestors` logging to INFO: I think that 
logging is primarily useful for Spark's own developers and not for users.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -732,6 +732,35 @@ private[spark] class DAGScheduler(
     missing.toList
   }
 
+  /** Invoke `.partitions` on the given RDD and all of its ancestors  */
+  private def eagerlyComputePartitionsForRddAndAncestors(rdd: RDD[_]): Unit = {
+    val startTime = System.nanoTime
+    val visitedRdds = new HashSet[RDD[_]]
+    // We are manually maintaining a stack here to prevent StackOverflowError
+    // caused by recursively visiting
+    val waitingForVisit = new ListBuffer[RDD[_]]
+    waitingForVisit += rdd
+
+    def visit(rdd: RDD[_]): Unit = {
+      if (!visitedRdds(rdd)) {
+        visitedRdds += rdd
+
+        // Eagerly compute:
+        rdd.partitions
+
+        for (dep <- rdd.dependencies) {
+          waitingForVisit.prepend(dep.rdd)
+        }
+      }
+    }
+
+    while (waitingForVisit.nonEmpty) {
+      visit(waitingForVisit.remove(0))
+    }
+    logDebug("eagerlyComputePartitionsForRddAndAncestors for RDD %d took %f 
seconds"

Review comment:
       To clarify, I think we should tackle user-facing logging in a separate 
PR, since it involves some design decisions that I'd like to consider 
separately from this fix.
   
   I think the goal of INFO logging would be to help users explain unexpected 
pauses in their driver code. A slow `.partitions` call is one source of pauses, 
but those `.partitions` calls often occur before 
`eagerlyComputePartitionsForRddAndAncestors` is called. As a result, I think 
that user-facing info/warning logging should be added [in RDD.partitions 
itself](https://github.com/apache/spark/blob/d9b4cc65b89d8497dc4e315395f2c98cc4ac9327/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L292)
 and not here in `eagerlyComputePartitionsForRddAndAncestors`.
   
   Once we've done that, I'm not sure if there's additional value in promoting 
the `eagerlyComputePartitionsForRddAndAncestors` logging to INFO: I think that 
logging is primarily useful for Spark's own developers and not for users.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen commented on a change in pull request #34265: [SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler

Reply via email to