thinkharderdev opened a new issue, #3360: URL: https://github.com/apache/arrow-datafusion/issues/3360
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) **Describe the solution you'd like** A clear and concise description of what you want to happen. arrow-rs has recently added the ability to do row-level filtering while decoding parquet files. This can dramatically reduce decoding and IO overhead when appropriately selective pruning predicates are pushed down to the table scan. We should support this in `ParquetExec` **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. If a `ParquetExec` has a `PruningPredicate` it should be "compiled" to a vector of `ArrowPredicate`(s) and supplied as a `RowFilter` to the `ParquetRecordBatchStream`. We can implement this a couple of different ways: 1. Take the pruning `Expr`, create a `PhysicalExpr` and then implement a single `ArrowPredictaeFn` which evaluates it. 2. Break apart conjunctions in the pruning `Expr` and compile each to a separate `ArrowPredicateFn` that will be applied sequentially. We can either take the ordering as given or apply some heuristic to determine the ordering. Some considerations: 1. We need to be careful to avoid redundant computations. As filter predicates are pushed down in the optimizer you may end up with a predicate on the scan involving the output a scalar function and also a projection in a later stage involving the same computation. If you naively execute all filter predicates during the scan you may end up doing the same work twice. 2. The filtering is not free. At minimum you may end up decoding the same data multiple times. 3. Currently, if users have a custom `TableProvider` they can control what gets pushed down. Is that enough configurability? 4. Breaking the predicate into multiple filters can help in some situations if you can get the ordering correct (cheap filter first, expensive filters after) but if you get it wrong then it can be bad. This may be amplified as we add the ability to parallelize the column decoding (i.e. as the morsel-driven scheduler progresses). **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
