etseidl commented on PR #8714: URL: https://github.com/apache/arrow-rs/pull/8714#issuecomment-3458679166
> 🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query > > The benchmark above parses all the columns For sure. I did a quick test with https://github.com/apache/arrow-rs/pull/8714/commits/b3675628538be81cc30c2c9f6cfb381e0e08631f where I only read every other row group's metadata. The "wide" benchmark (which happily now includes the index, thanks again @lichuang!) went from 54s to 30s. I'd bet only decoding 10 out of 10000 column would be crazy fast (still have to do more plumbing before I can try that one). On a related note, if you (@alamb, but others welcome) could opine on #8643 I'd appreciated it. I'm having a hard time wrapping my head around how best to convey down to the thrift parsing code which bits of metadata are wanted. I get confused with multiple readers each with different options objects, that all then sort of use `ParquetMetaDataReader`, except now there's the push decoder and `MetadataParser`. For instance, how would one hook a column projection or pushdown predicate into the metadata parsing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
