mbutrovich commented on code in PR #2168: URL: https://github.com/apache/datafusion-comet/pull/2168#discussion_r2282755635
########## spark/src/main/java/org/apache/comet/CometBatchIterator.java: ########## @@ -26,13 +26,45 @@ import org.apache.comet.vector.NativeUtil; /** - * An iterator that can be used to get batches of Arrow arrays from a Spark iterator of - * ColumnarBatch. It will consume input iterator and return Arrow arrays by addresses. This is - * called by native code to retrieve Arrow arrays from Spark through JNI. + * A Java adapter iterator that provides batch-by-batch Arrow array access for native code + * consumption. This class serves as a bridge between Spark's ColumnarBatch format and native + * DataFusion execution. + * + * <h2>Architecture Role</h2> + * + * CometBatchIterator acts as a pull-based data source for native execution: + * + * <ul> + * <li>Wraps Spark ColumnarBatch iterators from upstream operators + * <li>Exports Arrow arrays to native code via memory addresses using Arrow's C Data Interface + * <li>Provides JNI-friendly API for native code consumption + * </ul> + * + * <h2>Memory Ownership Model</h2> + * + * Batches are owned by the JVM. Native code can safely access the batch after calling `next` but + * the native code must not retain references to the batch because the next call to `hasNext` will + * signal to the JVM that the batch can be closed. + * + * <pre> + * JVM Phase: CometBatchIterator owns ColumnarBatch + * ↓ + * Export Phase: Arrays exported via memory addresses + * ↓ + * Native Phase: Native code processes Arrow arrays + * ↓ + * Release Phase: CometBatchIterator releases reference to batch + * </pre> + * + * <h2>Thread Safety</h2> + * + * This class is <strong>NOT thread-safe</strong>. It's designed for single-threaded access from + * native code via JNI. Concurrent access will cause race conditions and memory corruption. */ public class CometBatchIterator { - final Iterator<ColumnarBatch> input; - final NativeUtil nativeUtil; + private final Iterator<ColumnarBatch> input; + private final NativeUtil nativeUtil; + private ColumnarBatch previousBatch = null; Review Comment: Does this solve the problem where shuffle writer buffers multiple batches, or just buy us safety for one extra batch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org