alamb opened a new issue, #8843: URL: https://github.com/apache/arrow-rs/issues/8843
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** - part of https://github.com/apache/arrow-rs/issues/8000 - Related to #8733 from @hhhizzz - Related to https://github.com/apache/arrow-rs/issues/5523 TLDR is I want to 1) advance the state of understanding of how late materialization / filter pushdown works, and 2) tell the world how great the Rust implementation is (and implicitly explain what other types of optimizations are unlocked by this) I think there is significant room to help industrial practitioners by explaining the challenges that go into implementing late materialization "for real" in an industrial strength Paruet reader Background The techniques for implementing "late materialization" in column stores is well understood and explained well first in 2006/2007: - [Materialization Strategies in a Column-Oriented DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf) - [Column-Stores vs. Row-Stores: How Different Are They Really?](https://www.cs.umd.edu/~abadi/papers/abadisigmod06.pdf) The current Rust Parquet reader supports late materialization (basically the "EM Pipelined" strategy in this diagram from [Materialization Strategies in a Column-Oriented DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf) <img width="367" height="305" alt="Image" src="https://github.com/user-attachments/assets/3cd787c6-483b-4047-a5fc-f80af300ad87" /> The API for evaluating predicates during the scan via the [ArrowReaderBuilder::with_row_filter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter). See details on the [RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html) API . @XiangpengHao also gives a good background treatment in the context of adding a predicate cache (to avoid the overhead of decompressing pages twice): https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/ However, it has taken us several years (and we are still not quite there) to get to the point that we can turn on late materialization on "for real" due to various engineering challenges (decompression speed). Interestingly, there is a similar discussion on filter representation in [Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses](https://dl.acm.org/doi/10.1145/3626246.3653395) -- referred to as `4.1.1 Range Index` and `4.1.2 Bitmap Index` **Describe the solution you'd like** I would like to write a blog that highlights the tradeoffs in filter representation how we worked to improve it. **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
