stczwd opened a new pull request #35256: URL: https://github.com/apache/spark/pull/35256
### Why are the changes needed? Based on [34291](https://github.com/apache/spark/pull/34291), we can support limit push down to parquet datasource v2 reader, which can stop scanning parquet early, and reduce network and disk IO. Currently, only vectorized reader is supported in this pr. Row based reader with limit pushdown needs to be supported in parquet-hadoop first, thus it will be supported in the next pr. Limit parse status for parquet before ``` == Physical Plan == CollectLimit 10 +- *(1) ColumnarToRow +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] ``` After ``` == Physical Plan == CollectLimit 10 +- *(1) ColumnarToRow +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [], PushedLimit: Some(10) RuntimeFilters: [] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? origin tests and new tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
