Re: [I] API for encoding/decoding ParquetMetadata with more control [arrow-rs]

via GitHub Sat, 13 Jul 2024 10:50:11 -0700


adriangb commented on issue #6002:
URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2227009013


   > Also are you trying to support when you have bytes in memory that you want 
to decode parquet metadata from?
   
   Yes, exactly. But to get those bytes in memory I also have to write them 
somehow.
   
   The big picture use case is that I have a `Vec<RecordBatch>` in memory that 
I want to write out to a Parquet file in an object store. I also want to save 
metadata (in the general sense) about this new file to a commit log / secondary 
index. This metadata (in the general sense) store has file paths, partitioning 
information, file sizes, creation dates, row group statistics and also the 
parquet metadata. The point is that I can then take a query and push down as 
much as I can into this metadata store, returning everything I need to start 
reading files from object storage while minimizing slow object storage IO. If I 
store the parquet metadata there as well then in a single query to the metadata 
store I can get everything I need to start reading chunks of actual data from 
object storage.
   
   Currently I'm writing the `Vec<RecordBatch>` to a `Bytes` (maybe in the 
future I'll want to write directly to object storage but that's a problem for 
another day) then using something like described in 
https://github.com/apache/arrow-rs/issues/6002#issuecomment-2221000971 to 
extract just the metadata from those bytes. Having a metadata writer as I'm 
trying to do in #6000  would make this a _bit_ less hacky because I could load 
the ParquetMetadata from the in-memory bytes of the entire file (there are 
various APIs already available for this, e.g. `MetadataLoader`) instead of 
doing the trick of tracking which bytes are being read.
   
   In thinking about it more I don't think we need a new metadata loader. There 
are various places where metadata references byte ranges or offsets that apply 
to the entire file (e.g. the column index offsets) so there's always going to 
be a bit of friction trying to load metadata without the rest of the file. 
Maybe this is an indication that I'm abusing metadata and instead should be 
making a completely parallel structure but practically that's unjustifiable in 
terms of complexity and adding more conversions to load / dump when we already 
have a good serialization format. In any case, I think a simplified version of 
https://github.com/apache/arrow-rs/issues/6002#issuecomment-2221000971 for 
reading would be okay:
   
   ```rust
   #[derive(Debug, Clone)]
   struct AsyncBytes {
       file_size: usize,
       data_suffix: Bytes,
       min_offset: usize,
       max_offset: usize,
   }
   
   impl AsyncBytes {
       fn new(file_size: usize, data_suffix: Bytes) -> Self {
           Self {
               file_size,
               data_suffix,
           }
       }
   }
   
   impl MetadataFetch for &mut AsyncBytes {
       fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, 
ParquetResult<Bytes>> {
           let available_range = self.file_size - 
self.data_suffix.len()..self.file_size;
           if !(available_range.start <= range.start && available_range.end >= 
range.end) {
               return async move {
                   let err = format!(
                       "Attempted to fetch data from outside metadata section: 
range={range:?}, available_range={available_range:?}"
                   );
                   Err(parquet::errors::ParquetError::General(err))
               }
               .boxed();
           }
           // adjust the range to be within the data section
           let range = range.start - available_range.start..range.end - 
available_range.start;
           let data = self.data_suffix.slice(range.start..range.end);
           async move { Ok(data) }.boxed()
       }
   }
   
   pub async fn load_metadata(
       file_size: usize,
       serialized_parquet_metadata: Bytes,
   ) -> ParquetResult<Arc<ParquetMetaData>> {
       let mut reader = AsyncBytes::new(file_size, 
serialized_parquet_metadata.clone());
       let loader = MetadataLoader::load(&mut reader, file_size, None).await?;
       let loaded_metadata = loader.finish();
       let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
       metadata.load_page_index(true, true).await?;
       Ok(Arc::new(metadata.finish()))
   }
   ```
   
   I don't know if you feel this code is worth committing to the project, I'm 
happy to just use it myself until someone comes along with another use case for 
loading ParquetMetadata from just the metadata bytes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] API for encoding/decoding ParquetMetadata with more control [arrow-rs]

Reply via email to