yjshen commented on pull request #1990: URL: https://github.com/apache/arrow-datafusion/pull/1990#issuecomment-1065864478
Hi @tustvold , the filter is based on row-group midpoint position. It was introduced recently in parquet crate with https://github.com/apache/arrow-rs/commit/2bca71e322fcab6c6d93a47ef71638a617e29f6c. The midpoint filtering is modeled after the [ParquetSplit](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L67-L91) and [MetadataConverter](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1241-L1292) The parquet row groups level parallelism is used in MapReduce and Spark. In Spark [`splitFiles`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala#L26-L45) is used to generate task partitions based on partition size settings. And it may partition bigger parquet file parts to different partitions. Currently, this PR is still WIP, since only physical plan changes are implemented. And we translate Spark physical plan to DataFusion physical plan to run natively in DataFusion https://github.com/blaze-init/spark-blaze-extension/blob/master/src/main/scala/org/apache/spark/sql/blaze/plan/NativeParquetScanExec.scala#L57-L63 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
