alamb commented on issue #8643: URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3467310457
> > Finally, another interesting question is if the ArrowReader should try and minimize the metadata decoding on its own. > > For example, if the reader is asked to read only 3 columns, and no other instruction is given for metadata, should it only decode the metadata for those three columns? > > I think the answer is yes.... > > I think so as well. > > I need to look back at old discussions, but IIRC there was a suggestion to stash the footer bytes, and then materialize bits of metadata on demand. With an index this now becomes possible. That could solve the "how do we index this" question. > > Edit: it was [@XiangpengHao](https://github.com/XiangpengHao) [#5855 (comment)](https://github.com/apache/arrow-rs/issues/5855#issuecomment-2154960257). With the index as part of the footer, the penalty when wanting to read the entire file goes away. I think the code in arrow-rs / parquet-rs should just do the best with what it has and leave additional caching / optimization to other layers. For example, in DataFusion, now the code already caches the entire ParquetMetaData (including column index) and passes it into the arrow-rs code for all columns in many cases, so adding additional caching in the parquet reader itself seems unecessary. What I think would help is APIs for progressively reading / populating the metadata (e.g. initially only read 5 columns, but then be able to incrementally parse / produce the remaining columns after) -- this maybe is APIs on ParquetMetaData to add new columns, / row groups -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
