[GitHub] [arrow-datafusion] thinkharderdev opened a new issue, #3360: Support `RowFilter` in `ParquetExec`

GitBox Mon, 05 Sep 2022 05:11:52 -0700


thinkharderdev opened a new issue, #3360:
URL: https://github.com/apache/arrow-datafusion/issues/3360


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   arrow-rs has recently added the ability to do row-level filtering while 
decoding parquet files. This can dramatically reduce decoding and IO overhead 
when appropriately selective pruning predicates are pushed down to the table 
scan. We should support this in `ParquetExec`
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   If a `ParquetExec` has a `PruningPredicate` it should be "compiled" to a 
vector of `ArrowPredicate`(s) and supplied as a `RowFilter` to the 
`ParquetRecordBatchStream`. We can implement this a couple of different ways:
   
   1. Take the pruning `Expr`, create a `PhysicalExpr` and then implement a 
single `ArrowPredictaeFn` which evaluates it. 
   2. Break apart conjunctions in the pruning `Expr` and compile each to a 
separate `ArrowPredicateFn` that will be applied sequentially. We can either 
take the ordering as given or apply some heuristic to determine the ordering. 
   
   Some considerations:
   
   1. We need to be careful to avoid redundant computations. As filter 
predicates are pushed down in the optimizer you may end up with a predicate on 
the scan involving the output a scalar function and also a projection in a 
later stage involving the same computation. If you naively execute all filter 
predicates during the scan you may end up doing the same work twice. 
   2. The filtering is not free. At minimum you may end up decoding the same 
data multiple times. 
   3. Currently, if users have a custom `TableProvider` they can control what 
gets pushed down. Is that enough configurability? 
   4. Breaking the predicate into multiple filters can help in some situations 
if you can get the ordering correct (cheap filter first, expensive filters 
after) but if you get it wrong then it can be bad. This may be amplified as we 
add the ability to parallelize the column decoding (i.e. as the morsel-driven 
scheduler progresses). 
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] thinkharderdev opened a new issue, #3360: Support `RowFilter` in `ParquetExec`

Reply via email to