Re: [PR] perf: [EXPERIMENTAL] cache and broadcast serialized plans across partitions [datafusion-comet]

via GitHub Thu, 22 Jan 2026 12:13:25 -0800


andygrove commented on code in PR #3244:
URL: https://github.com/apache/datafusion-comet/pull/3244#discussion_r2718434424



##########
spark/src/main/scala/org/apache/spark/sql/comet/CometTakeOrderedAndProjectExec.scala:
##########
@@ -133,12 +133,20 @@ case class CometTakeOrderedAndProjectExec(
           CometExecUtils.getNativeLimitRDD(childRDD, child.output, limit)
         } else {
           val numParts = childRDD.getNumPartitions
+          val numOutputCols = child.output.length
+          // Serialize the plan once and broadcast to avoid repeated 
serialization
+          val serializedTopK = CometExec.serializePlan(
+            CometExecUtils
+              .getTopKNativePlan(child.output, sortOrder, child, limit)
+              .get)
+          val broadcastTopK = sparkContext.broadcast(serializedTopK)

Review Comment:
   I think we can rely on Spark to GC?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: [EXPERIMENTAL] cache and broadcast serialized plans across partitions [datafusion-comet]

Reply via email to