Re: [PR] chore: Improve documentation for `CometBatchIterator` and fix a potential issue [datafusion-comet]

via GitHub Mon, 18 Aug 2025 08:32:47 -0700


mbutrovich commented on code in PR #2168:
URL: https://github.com/apache/datafusion-comet/pull/2168#discussion_r2282755635



##########
spark/src/main/java/org/apache/comet/CometBatchIterator.java:
##########
@@ -26,13 +26,45 @@
 import org.apache.comet.vector.NativeUtil;
 
 /**
- * An iterator that can be used to get batches of Arrow arrays from a Spark 
iterator of
- * ColumnarBatch. It will consume input iterator and return Arrow arrays by 
addresses. This is
- * called by native code to retrieve Arrow arrays from Spark through JNI.
+ * A Java adapter iterator that provides batch-by-batch Arrow array access for 
native code
+ * consumption. This class serves as a bridge between Spark's ColumnarBatch 
format and native
+ * DataFusion execution.
+ *
+ * <h2>Architecture Role</h2>
+ *
+ * CometBatchIterator acts as a pull-based data source for native execution:
+ *
+ * <ul>
+ *   <li>Wraps Spark ColumnarBatch iterators from upstream operators
+ *   <li>Exports Arrow arrays to native code via memory addresses using 
Arrow's C Data Interface
+ *   <li>Provides JNI-friendly API for native code consumption
+ * </ul>
+ *
+ * <h2>Memory Ownership Model</h2>
+ *
+ * Batches are owned by the JVM. Native code can safely access the batch after 
calling `next` but
+ * the native code must not retain references to the batch because the next 
call to `hasNext` will
+ * signal to the JVM that the batch can be closed.
+ *
+ * <pre>
+ * JVM Phase:     CometBatchIterator owns ColumnarBatch
+ *                        ↓
+ * Export Phase:  Arrays exported via memory addresses
+ *                        ↓
+ * Native Phase:  Native code processes Arrow arrays
+ *                        ↓
+ * Release Phase: CometBatchIterator releases reference to batch
+ * </pre>
+ *
+ * <h2>Thread Safety</h2>
+ *
+ * This class is <strong>NOT thread-safe</strong>. It's designed for 
single-threaded access from
+ * native code via JNI. Concurrent access will cause race conditions and 
memory corruption.
  */
 public class CometBatchIterator {
-  final Iterator<ColumnarBatch> input;
-  final NativeUtil nativeUtil;
+  private final Iterator<ColumnarBatch> input;
+  private final NativeUtil nativeUtil;
+  private ColumnarBatch previousBatch = null;

Review Comment:
   Does this solve the problem where shuffle writer buffers multiple batches, 
or just buy us safety for one extra batch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Improve documentation for `CometBatchIterator` and fix a potential issue [datafusion-comet]

Reply via email to