Re: [PR] docs: Add documentation for architecture of Comet Parquet support [datafusion-comet]

via GitHub Sat, 07 Sep 2024 12:08:57 -0700


comphead commented on code in PR #922:
URL: https://github.com/apache/datafusion-comet/pull/922#discussion_r1748909897



##########
docs/source/contributor-guide/plugin_overview.md:
##########
@@ -57,3 +57,50 @@ and this serialized plan is passed into the native code by 
`CometExecIterator`.
 In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which 
converts the serialized plan into an
 Apache DataFusion physical plan. In some cases, Comet provides specialized 
physical operators and expressions to
 override the DataFusion versions to ensure compatibility with Apache Spark.
+
+## Parquet Support
+
+### Native Parquet Scan with v1 Data Source
+
+When reading from Parquet v1 data sources, Comet provides JVM code for 
performing the reads from disk and 
+implementing predicate pushdown to skip row groups and then delegates to 
native code for decoding Parquet pages and 
+row groups into Arrow arrays.
+
+![Diagram of Comet Native Parquet 
Scan](../../_static/images/CometNativeParquetScan.drawio.png)
+
+`CometScanRule` replaces `FileSourceScanExec` with `CometScanExec`.
+
+`CometScanExec.doExecuteColumnar` creates an instance of 
`CometParquetPartitionReaderFactory` and passes it either 
+into a `DataSourceRDD` (if prefetch is enabled) or a `FileScanRDD`. It then 
calls `mapPartitionsInternal` on the 
+`RDD` and wraps the resulting `Iterator[ColumnarBatch]` in another iterator 
that collects metrics such as `scanTime`
+and `numOutputRows`.
+
+`CometParquetPartitionReaderFactory` will create a 
`org.apache.comet.parquet.BatchReader` which in turn creates one
+column reader per column. There are different column reader implementations 
for different data types and encodings. The
+column readers invoke methods on the `org.apache.comet.parquet.Native` class 
such as `resetBatch`, `readBatch`, 
+and `currentBatch`.
+
+The `CometScanExec` provides batches that will be read by the `ScanExec` 
native plan leaf node. `CometScanExec` is 
+wrapped in a `CometBatchIterator` that will convert Spark's `ColumnarBatch` 
into Arrow Arrays. This is then wrapped in
+a `CometExecIterator` that will consume the Arrow Arrays and execute the 
native plan via methods on 

Review Comment:
   Its not related to Scan, but I would also add the how CometExecIterator work 
with the  execution plan. When
   `CometExecIterator` calls iteratively `getNextBatch` which executes the  
nodes of execution plan on the native side, returning memory addresses of Arrow 
Arrays as the result.
   Then using the memory address the Arrow Array imported as Spark 
`ColumnarBatch` to be processed by Spark JVM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: Add documentation for architecture of Comet Parquet support [datafusion-comet]

Reply via email to