Re: [PR] feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP [datafusion-comet]

via GitHub Wed, 04 Feb 2026 23:32:48 -0800


peter-toth commented on code in PR #3349:
URL: https://github.com/apache/datafusion-comet/pull/3349#discussion_r2767527439



##########
spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala:
##########
@@ -19,39 +19,204 @@
 
 package org.apache.spark.sql.comet
 
-import org.apache.spark.{Partition, SparkContext, TaskContext}
-import org.apache.spark.rdd.{RDD, RDDOperationScope}
+import org.apache.spark._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.execution.ScalarSubquery
 import org.apache.spark.sql.vectorized.ColumnarBatch
+import org.apache.spark.util.SerializableConfiguration
+
+import org.apache.comet.CometExecIterator
+import org.apache.comet.serde.OperatorOuterClass
+
+/**
+ * Partition that carries per-partition planning data, avoiding closure 
capture of all partitions.
+ */
+private[spark] class CometExecPartition(
+    override val index: Int,
+    val inputPartitions: Array[Partition],
+    val planDataByKey: Map[String, Array[Byte]])
+    extends Partition
 
 /**
- * A RDD that executes Spark SQL query in Comet native execution to generate 
ColumnarBatch.
+ * Unified RDD for Comet native execution.
+ *
+ * Solves the closure capture problem: instead of capturing all partitions' 
data in the closure
+ * (which gets serialized to every task), each Partition object carries only 
its own data.
+ *
+ * Handles three cases:
+ *   - With inputs + per-partition data: injects planning data into operator 
tree
+ *   - With inputs + no per-partition data: just zips inputs (no injection 
overhead)
+ *   - No inputs: uses numPartitions to create partitions
+ *
+ * NOTE: This RDD does not handle DPP (InSubqueryExec), which is resolved in
+ * CometIcebergNativeScanExec.serializedPartitionData before this RDD is 
created. It also handles
+ * ScalarSubquery expressions by registering them with CometScalarSubquery 
before execution.
  */
 private[spark] class CometExecRDD(

Review Comment:
   Can you please implement `clearDependencies()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: [iceberg] CometExecRDD supports per-partition plan data, Iceberg native scan with DPP [datafusion-comet]

Reply via email to