jecsand838 commented on code in PR #8006: URL: https://github.com/apache/arrow-rs/pull/8006#discussion_r2252820147
########## arrow-avro/src/reader/mod.rs: ########## @@ -124,23 +132,26 @@ fn read_header<R: BufRead>(mut reader: R) -> Result<Header, ArrowError> { /// A low-level interface for decoding Avro-encoded bytes into Arrow `RecordBatch`. #[derive(Debug)] pub struct Decoder { - record_decoder: RecordDecoder, + active_decoder: RecordDecoder, + active_fingerprint: Option<Fingerprint>, batch_size: usize, - decoded_rows: usize, + remaining_capacity: usize, + #[cfg(feature = "lru")] + cache: LruCache<Fingerprint, RecordDecoder>, + #[cfg(not(feature = "lru"))] + cache: IndexMap<Fingerprint, RecordDecoder>, + max_cache_size: usize, + reader_schema: Option<AvroSchema<'static>>, + writer_schema_store: Option<SchemaStore<'static>>, Review Comment: @scovich You're right. Here's where my research led me: Requiring `'static` lifetimes for Avro schemas means either restricting schemas to static data or leaking heap memory for dynamic schema components. One thing I noticed is that the `HeaderDecoder` seems able to get around this by parsing a json string from the first row of an Avro file during the `ReaderBuilder::build` process, i.e. ```rust pub fn schema(&self) -> Result<Option<Schema<'_>>, ArrowError> { self.get(SCHEMA_METADATA_KEY) .map(|x| { serde_json::from_slice(x).map_err(|e| { ArrowError::ParseError(format!("Failed to parse Avro schema JSON: {e}")) }) }) .transpose() } ``` Do you think an approach that uses JSON Strings instead of `Schema`, for instance: ```rust /// Sets the Avro reader schema. /// /// If a schema is not provided, the schema will be read from the Avro file header. pub fn with_reader_schema(mut self, reader_schema: String) -> Self { ... } /// Sets the `SchemaStore` used for resolving writer schemas. /// /// This is necessary when decoding single-object encoded data that identifies /// schemas by a fingerprint. The store allows the decoder to look up the /// full writer schema from a fingerprint embedded in the data. /// /// Defaults to `None`. pub fn with_writer_schema_store(mut self, store: HashMap<Fingerprint, String>) -> Self { ... } ``` Then parses the each `Schema` inside the `build` and `build_decoder` methods would be a way around this? UPDATE: Maybe we could also pre-build the `cache: IndexMap<Fingerprint, RecordDecoder>,` from JSON strings in the `ReaderBuilder::build_decoder` method and then pass that into the new `Decoder` so we can remove: ```rust reader_schema: Option<AvroSchema<'static>>, writer_schema_store: Option<SchemaStore<'static>>, ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org