[GitHub] [spark] stczwd opened a new pull request #35256: [SPARK-37831][SQL] Limit push down for parquet vectorized reader

GitBox Thu, 20 Jan 2022 00:10:14 -0800


stczwd opened a new pull request #35256:
URL: https://github.com/apache/spark/pull/35256



   ### Why are the changes needed?
   Based on [34291](https://github.com/apache/spark/pull/34291), we can support 
limit push down to parquet datasource v2 reader, which can stop scanning 
parquet early, and reduce network and disk IO.
   Currently, only vectorized reader is supported in this pr. Row based reader 
with limit pushdown needs to be supported in parquet-hadoop first, thus it will 
be supported in the next pr.
   
   Limit parse status for parquet
   before
   ```
   == Physical Plan ==
   CollectLimit 10
   +- *(1) ColumnarToRow
      +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
Location: InMemoryFileIndex(1 
paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
[] RuntimeFilters: [] 
   ```
   After
   ```
   == Physical Plan ==
   CollectLimit 10
   +- *(1) ColumnarToRow
      +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, 
Location: InMemoryFileIndex(1 
paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], 
PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: 
struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: 
[], PushedLimit: Some(10) RuntimeFilters: [] 
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   origin tests and new tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] stczwd opened a new pull request #35256: [SPARK-37831][SQL] Limit push down for parquet vectorized reader

Reply via email to