alamb opened a new issue, #8842: URL: https://github.com/apache/arrow-rs/issues/8842
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** The idea of [executing directly on "compressed" (encoded) data](https://www.cs.umd.edu/~abadi/papers/abadisigmod06.pdf) is a well known technique in the academic columnar store literature. The Parquet format currently supports several different [encodings](https://github.com/apache/parquet-format/blob/master/Encodings.md) such as: - Plain - Dictionary (and hybrid) - RLE/Bitpacked - DeltaBinaryPacked - DeltaByteArray - ByteStreamSplit The current Rust Parquet reader supports evaluating predicates during the scan via the [ArrowReaderBuilder::with_row_filter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter). However, the current [RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html) API only permits evaluation on Arrow arrays, aka **after** decompression and decoding. It would be interesting to consider supporting evaluating predicates directly on parquet encoded data, to avoid decode costs when unnecessary -- for example when doing "needle in the haystack" type queries that filter out almost all rows As the community considers [adding more encodings], such as ALP and FSST that are even more amenable to encoded operation, the benefit of such an API will increase. **Describe the solution you'd like** I would like some way to evaluate predicates directly on the encoded Parquet data **Describe alternatives you've considered** One potential thing we could do is extend the existing RowFilter API so different implementations could be provided for different encodings ```rust // add predicates evaluated on (decoded) Arrow arrays let filter = RowFilter::new(arrow_predicates) // add specializations that can evaluate directly on RLE encoded data // if the data is not RLE encoded, fall back to the arrow ones .with_rle_predicates(...) ``` We would have to do some more experiments to know what the API for RLE (and similar) predicates should be **Additional context** The Vortex file implementation seems to be heading towards adding their own Expression implementation for common expressions and then providing the For example, see how VortexExpr::evaluate looks (it has all the knowledge of the types, etc) https://docs.rs/vortex-expr/0.54.0/vortex_expr/trait.VortexExpr.html#method.evaluate -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
