adriangb commented on issue #6002: URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2227009013
> Also are you trying to support when you have bytes in memory that you want to decode parquet metadata from? Yes, exactly. But to get those bytes in memory I also have to write them somehow. The big picture use case is that I have a `Vec<RecordBatch>` in memory that I want to write out to a Parquet file in an object store. I also want to save metadata (in the general sense) about this new file to a commit log / secondary index. This metadata (in the general sense) store has file paths, partitioning information, file sizes, creation dates, row group statistics and also the parquet metadata. The point is that I can then take a query and push down as much as I can into this metadata store, returning everything I need to start reading files from object storage while minimizing slow object storage IO. If I store the parquet metadata there as well then in a single query to the metadata store I can get everything I need to start reading chunks of actual data from object storage. Currently I'm writing the `Vec<RecordBatch>` to a `Bytes` (maybe in the future I'll want to write directly to object storage but that's a problem for another day) then using something like described in https://github.com/apache/arrow-rs/issues/6002#issuecomment-2221000971 to extract just the metadata from those bytes. Having a metadata writer as I'm trying to do in #6000 would make this a _bit_ less hacky because I could load the ParquetMetadata from the in-memory bytes of the entire file (there are various APIs already available for this, e.g. `MetadataLoader`) instead of doing the trick of tracking which bytes are being read. In thinking about it more I don't think we need a new metadata loader. There are various places where metadata references byte ranges or offsets that apply to the entire file (e.g. the column index offsets) so there's always going to be a bit of friction trying to load metadata without the rest of the file. Maybe this is an indication that I'm abusing metadata and instead should be making a completely parallel structure but practically that's unjustifiable in terms of complexity and adding more conversions to load / dump when we already have a good serialization format. In any case, I think a simplified version of https://github.com/apache/arrow-rs/issues/6002#issuecomment-2221000971 for reading would be okay: ```rust #[derive(Debug, Clone)] struct AsyncBytes { file_size: usize, data_suffix: Bytes, min_offset: usize, max_offset: usize, } impl AsyncBytes { fn new(file_size: usize, data_suffix: Bytes) -> Self { Self { file_size, data_suffix, } } } impl MetadataFetch for &mut AsyncBytes { fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> { let available_range = self.file_size - self.data_suffix.len()..self.file_size; if !(available_range.start <= range.start && available_range.end >= range.end) { return async move { let err = format!( "Attempted to fetch data from outside metadata section: range={range:?}, available_range={available_range:?}" ); Err(parquet::errors::ParquetError::General(err)) } .boxed(); } // adjust the range to be within the data section let range = range.start - available_range.start..range.end - available_range.start; let data = self.data_suffix.slice(range.start..range.end); async move { Ok(data) }.boxed() } } pub async fn load_metadata( file_size: usize, serialized_parquet_metadata: Bytes, ) -> ParquetResult<Arc<ParquetMetaData>> { let mut reader = AsyncBytes::new(file_size, serialized_parquet_metadata.clone()); let loader = MetadataLoader::load(&mut reader, file_size, None).await?; let loaded_metadata = loader.finish(); let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata); metadata.load_page_index(true, true).await?; Ok(Arc::new(metadata.finish())) } ``` I don't know if you feel this code is worth committing to the project, I'm happy to just use it myself until someone comes along with another use case for loading ParquetMetadata from just the metadata bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
