alamb opened a new issue #363:
URL: https://github.com/apache/arrow-datafusion/issues/363


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   DataFusion contains logic (originally contributed by @yordan-pavlov in 
https://github.com/apache/arrow/pull/9064 🎉 ) to perform Row Group Pruning, 
which skips scanning of entire row groups within a parquet file, based on 
pushed down predicates (source link in arrow-datafusion: 
[parquet.rs](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/parquet.rs)).
   
   The algorithm behind the Row Group Pruning implementation is general and can 
be applied to any storage system that maintains min/max statistics for 
different sets of files / chunks of the data and would like to quickly rule out 
chunks which can not match a predicate.
   
   We would like to reuse the row group pruning logic from DataFusion (rather 
than writing our own) because we want to make this logic easier to reuse by 
both other parts of DataFusion (e.g. pruning parquet *files* rather than just 
row groups) as well as downstream projects. We also hope to receive benefit 
ourselves as the community can work to improve this code
   
   In addition, there  other usecases, such as the one mentioned by 
@returnString, where you have a bunch of parquet files in some object store and 
statistics about the min/max values and you could skip entire files based on 
those statistics alone.  
   
   **Describe the solution you'd like**
   1. Refactor what is currently called `RowGroupPredicateBuilder` into 
something more generic related to `Pruning`
   2. Rework the implementation so it is  generic for a Statistics trait so 
that the predicates can be evaluated against any type (not just the Parquet 
`RowGroupMetadata`)
   
   **Additional context**
   
   You can see more about the usecase on the IOx ticket 
https://github.com/influxdata/influxdb_iox/issues/736 and [design 
document](https://docs.google.com/document/d/1ulK-jHxYEMTDQT77u0GzCMGRFC5MYQcegztkcnPHYnM/edit#)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to