[GitHub] [arrow-datafusion] yjshen commented on pull request #1990: WIP: Finer-grained parallelism for Parquet Scan


yjshen commented on pull request #1990:
URL: 
https://github.com/apache/arrow-datafusion/pull/1990#issuecomment-1065864478



   Hi @tustvold , the filter is based on row-group midpoint position. It was 
introduced recently in parquet crate with 
https://github.com/apache/arrow-rs/commit/2bca71e322fcab6c6d93a47ef71638a617e29f6c.
 The midpoint filtering is modeled after the 
[ParquetSplit](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L67-L91)
 and 
[MetadataConverter](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1241-L1292)
   
   The parquet row groups level parallelism is used in MapReduce and Spark. In 
Spark 
[`splitFiles`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala#L26-L45)
 is used to generate task partitions based on partition size settings. And it 
may partition bigger parquet file parts to different partitions.
   
   Currently, this PR is still WIP, since only physical plan changes are 
implemented. And we translate Spark physical plan to DataFusion physical plan 
to run natively in DataFusion 
https://github.com/blaze-init/spark-blaze-extension/blob/master/src/main/scala/org/apache/spark/sql/blaze/plan/NativeParquetScanExec.scala#L57-L63


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on pull request #1990: WIP: Finer-grained parallelism for Parquet Scan

Reply via email to