[I] Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects [arrow-rs]

via GitHub Wed, 03 Jul 2024 09:10:03 -0700


alamb opened a new issue, #5999:
URL: https://github.com/apache/arrow-rs/issues/5999


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   In low latency parquet based query applications, it is important to be able 
to cache / reuse the `ParquetMetaData` from parquet files (to supply via 
[ArrowReaderBuilder::new_with_metadata](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.new_with_metadata)
 instead of re-reading / parsing it from the parquet footer while reading the 
parquet data)
   
   For many such systems (including InfluxDB 3.0) many of the files have the 
same schema so storing the same schema information for each parquet file is 
wasteful
   
   
   
   **Describe the solution you'd like**
   I would like a way to share `SchemaDescriptorPtr` -- e.g. the schema is 
already wrapped in an Arc so it is likely possibly to avoid storing the same 
schema over and over again 
   
   https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#197 . 
   
   
   **Describe alternatives you've considered**
   
   Perhaps we could add an API like `with_schema` to ParquetMetadata:
   
   ```rust
   impl ParquetMetaData { 
   ... 
     /// Set the internal schema pointers
     fn with_schema(self, schema_descr: SchemaDescPtr) -> Self {
      ..
     }
   ...
   }
   ```
   
   It could be used like this:
   
   ```rust
   let mut metadata: PaquetMetadata = ... // load metadata from a parquet file
   // Check if we already have the same schema loaded
   if let Some(existing_schema) = find_existing_schema(&catalog, &metadata) {
     // if so, use the existing schema 
     metadata = metadata.with_schema()
   }
   ```
   
   
   **Additional context**
   
   This infrastructure is a natural follow on to 
https://github.com/apache/arrow-rs/issues/1729 to track the memory used
   
   This API would likely be be tricky to implement given there are several 
references to the schema in `ParquetMetadata` child fields (e.g. 
https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#299)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects [arrow-rs]

Reply via email to