comphead commented on code in PR #922: URL: https://github.com/apache/datafusion-comet/pull/922#discussion_r1748909897
########## docs/source/contributor-guide/plugin_overview.md: ########## @@ -57,3 +57,50 @@ and this serialized plan is passed into the native code by `CometExecIterator`. In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which converts the serialized plan into an Apache DataFusion physical plan. In some cases, Comet provides specialized physical operators and expressions to override the DataFusion versions to ensure compatibility with Apache Spark. + +## Parquet Support + +### Native Parquet Scan with v1 Data Source + +When reading from Parquet v1 data sources, Comet provides JVM code for performing the reads from disk and +implementing predicate pushdown to skip row groups and then delegates to native code for decoding Parquet pages and +row groups into Arrow arrays. + + + +`CometScanRule` replaces `FileSourceScanExec` with `CometScanExec`. + +`CometScanExec.doExecuteColumnar` creates an instance of `CometParquetPartitionReaderFactory` and passes it either +into a `DataSourceRDD` (if prefetch is enabled) or a `FileScanRDD`. It then calls `mapPartitionsInternal` on the +`RDD` and wraps the resulting `Iterator[ColumnarBatch]` in another iterator that collects metrics such as `scanTime` +and `numOutputRows`. + +`CometParquetPartitionReaderFactory` will create a `org.apache.comet.parquet.BatchReader` which in turn creates one +column reader per column. There are different column reader implementations for different data types and encodings. The +column readers invoke methods on the `org.apache.comet.parquet.Native` class such as `resetBatch`, `readBatch`, +and `currentBatch`. + +The `CometScanExec` provides batches that will be read by the `ScanExec` native plan leaf node. `CometScanExec` is +wrapped in a `CometBatchIterator` that will convert Spark's `ColumnarBatch` into Arrow Arrays. This is then wrapped in +a `CometExecIterator` that will consume the Arrow Arrays and execute the native plan via methods on Review Comment: Its not related to Scan, but I would also add the how CometExecIterator work with the execution plan. When `CometExecIterator` calls iteratively `getNextBatch` which executes the nodes of execution plan on the native side, returning memory addresses of Arrow Arrays as the result. Then using the memory address the Arrow Array imported as Spark `ColumnarBatch` to be processed by Spark JVM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
