alamb commented on issue #8843: URL: https://github.com/apache/arrow-rs/issues/8843#issuecomment-3575896523
> I’d like to give a quick progress update: this week I mainly read the papers mentioned above and relevant `arrow-rs` code, and tried to draft the approach. I expect it will take a few more days before I can submit the first version. Amazing! > One question I’d like to ask/confirm: based on my current understanding, **the current implementation looks more like an LM (Late Materialization) pipeline rather than an EM (Early Materialization) pipeline**. Your text and ref image are both EM, but context is LM. Yes, I am sorry -- I agree that what we have in the parquet reader is an early materialization pipeline > My understanding is: predicate columns are cached instead of being directly assembled into the tuple; they are read again later when needed, but because of the cache there’s no repeated decoding, so the full decoding process doesn’t have to be run again. Yes, that is correct. Specifically, we (well @XiangpengHao ) found that the cost of re-decoding (specifically decompressing with ZSTD or other block compression) often outweighed the benefits of late materialization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
