Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

via GitHub Tue, 28 Oct 2025 14:57:37 -0700


etseidl commented on PR #8714:
URL: https://github.com/apache/arrow-rs/pull/8714#issuecomment-3458679166


   > 🤔 I bet we would see a crazy speedup if we could also skip parsing 
ColumnChunk metadata for columns that are not read in the query
   > 
   > The benchmark above parses all the columns
   
   For sure. I did a quick test with 
https://github.com/apache/arrow-rs/pull/8714/commits/b3675628538be81cc30c2c9f6cfb381e0e08631f
 where I only read every other row group's metadata. The "wide" benchmark 
(which happily now includes the index, thanks again @lichuang!) went from 54s 
to 30s. I'd bet only decoding 10 out of 10000 column would be crazy fast (still 
have to do more plumbing before I can try that one).
   
   On a related note, if you (@alamb, but others welcome) could opine on #8643 
I'd appreciated it. I'm having a hard time wrapping my head around how best to 
convey down to the thrift parsing code which bits of metadata are wanted. I get 
confused with multiple readers each with different options objects, that all 
then sort of use `ParquetMetaDataReader`, except now there's the push decoder 
and `MetadataParser`. For instance, how would one hook a column projection or 
pushdown predicate into the metadata parsing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [POC] Metadata index for Parquet files [arrow-rs]

Reply via email to