Re: [I] API for encoding/decoding ParquetMetadata with more control [arrow-rs]

via GitHub Sat, 13 Jul 2024 04:02:29 -0700


alamb commented on issue #6002:
URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2226856342


   > I got my thing working, but it seems quite brittle. TLDR is that I'm just 
tracking what bytes DataFusion reads and then slicing to those. Which seems 
like it could be quite inefficient and might break if DataFusion changes 
internal details.
   
   Good to hear you got it working. Yes I agree getting a more flexible API 
worked out that is more efficient would be ideal
   
   As I think you are hinting at, `MetadataLoader` was designed for whatever 
the exact needs of the parquet reader were, so is not easy to use outside.
   
   Maybe a good place to start would be to write tests / examples of what you 
are trying to do. For example:
   
   1.  Read and decode metadata from a parquet footer
   * with/without offset index;
   * with/without bloom filters
   * when the initial pre-fetch didn't include the bytes for the FileMetadata
   * When the intiial pre-fetch didn't include the bytes for some of the out of 
line structures (offset index, bloom filters)
   
   Also are you trying to support when you have bytes in memory that you want 
to decode parquet metadata from?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] API for encoding/decoding ParquetMetadata with more control [arrow-rs]

Reply via email to