alamb opened a new issue, #8843:
URL: https://github.com/apache/arrow-rs/issues/8843

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   - part of https://github.com/apache/arrow-rs/issues/8000
   - Related to #8733 from @hhhizzz
   - Related to https://github.com/apache/arrow-rs/issues/5523 
   
   TLDR is I want to 1) advance the state of understanding of how late 
materialization / filter pushdown works, and 2) tell the world how great the 
Rust implementation is (and implicitly explain what other types of 
optimizations are unlocked by this)
   
   I think there is significant room to help industrial practitioners by 
explaining the challenges that go into implementing late materialization "for 
real" in an industrial strength Paruet reader
   
   Background
   
   The techniques for implementing "late materialization" in column stores is 
well understood and explained well first in 2006/2007:
   - [Materialization Strategies in a Column-Oriented 
DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf)
   - [Column-Stores vs. Row-Stores: How Different Are They 
Really?](https://www.cs.umd.edu/~abadi/papers/abadisigmod06.pdf)
   
   The current Rust Parquet reader supports late materialization (basically the 
"EM Pipelined" strategy in this diagram from [Materialization Strategies in a 
Column-Oriented DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf)
   
   <img width="367" height="305" alt="Image" 
src="https://github.com/user-attachments/assets/3cd787c6-483b-4047-a5fc-f80af300ad87";
 />
   
   The API for evaluating predicates during the scan via the  
[ArrowReaderBuilder::with_row_filter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter).
 See details on the 
[RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html)
 API .
   
   @XiangpengHao also gives a good background treatment in the context of 
adding a predicate cache (to avoid the overhead of decompressing pages twice): 
https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
   
   However, it has taken us several years (and we are still not quite there) to 
get to the point that we can turn on late materialization on "for real" due to 
various engineering challenges (decompression speed). 
   
   Interestingly, there is a similar discussion on filter representation in 
[Predicate Caching: Query-Driven Secondary Indexing for Cloud Data 
Warehouses](https://dl.acm.org/doi/10.1145/3626246.3653395) -- referred to as 
`4.1.1 Range Index` and `4.1.2 Bitmap Index`
   
   **Describe the solution you'd like**
   
   I would like to write a blog that highlights the tradeoffs in filter 
representation how we worked to improve it. 
   
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to