alamb commented on issue #8643:
URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3467310457

   > > Finally, another interesting question is if the ArrowReader should try 
and minimize the metadata decoding on its own.
   > > For example, if the reader is asked to read only 3 columns, and no other 
instruction is given for metadata, should it only decode the metadata for those 
three columns?
   > > I think the answer is yes....
   > 
   > I think so as well.
   > 
   > I need to look back at old discussions, but IIRC there was a suggestion to 
stash the footer bytes, and then materialize bits of metadata on demand. With 
an index this now becomes possible. That could solve the "how do we index this" 
question.
   > 
   > Edit: it was [@XiangpengHao](https://github.com/XiangpengHao) [#5855 
(comment)](https://github.com/apache/arrow-rs/issues/5855#issuecomment-2154960257).
 With the index as part of the footer, the penalty when wanting to read the 
entire file goes away.
   
   I think the code in arrow-rs / parquet-rs should just do the best with what 
it has and leave additional caching / optimization to other layers.
   
   For example, in DataFusion, now the code already caches the entire 
ParquetMetaData (including column index) and passes it into the arrow-rs code 
for all columns in many cases, so adding additional caching in the parquet 
reader itself seems unecessary.
   
   What I think would help is APIs for progressively reading / populating the 
metadata (e.g. initially only read 5 columns, but then be able to incrementally 
parse / produce the remaining columns after) -- this maybe is APIs on 
ParquetMetaData to add new columns, / row groups 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to