Re: [I] Blog post about efficient filter representation in Parquet filter pushdown [arrow-rs]

via GitHub Tue, 25 Nov 2025 06:23:17 -0800


alamb commented on issue #8843:
URL: https://github.com/apache/arrow-rs/issues/8843#issuecomment-3575896523


   > I’d like to give a quick progress update: this week I mainly read the 
papers mentioned above and relevant `arrow-rs` code, and tried to draft the 
approach. I expect it will take a few more days before I can submit the first 
version.
   
   Amazing!
   
   > One question I’d like to ask/confirm: based on my current understanding, 
**the current implementation looks more like an LM (Late Materialization) 
pipeline rather than an EM (Early Materialization) pipeline**. Your text and 
ref image are both EM, but context is LM. 
   
   Yes, I am sorry -- I agree that what we have in the parquet reader is an 
early materialization pipeline
   
   > My understanding is: predicate columns are cached instead of being 
directly assembled into the tuple; they are read again later when needed, but 
because of the cache there’s no repeated decoding, so the full decoding process 
doesn’t have to be run again. 
   
   Yes, that is correct. Specifically, we (well @XiangpengHao ) found that the 
cost of re-decoding (specifically decompressing with ZSTD or other block 
compression) often outweighed the benefits of late materialization
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Blog post about efficient filter representation in Parquet filter pushdown [arrow-rs]

Reply via email to