Re: [PR] Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution [arrow-rs]

via GitHub Mon, 04 Aug 2025 17:28:13 -0700


jecsand838 commented on code in PR #8006:
URL: https://github.com/apache/arrow-rs/pull/8006#discussion_r2252820147



##########
arrow-avro/src/reader/mod.rs:
##########
@@ -124,23 +132,26 @@ fn read_header<R: BufRead>(mut reader: R) -> 
Result<Header, ArrowError> {
 /// A low-level interface for decoding Avro-encoded bytes into Arrow 
`RecordBatch`.
 #[derive(Debug)]
 pub struct Decoder {
-    record_decoder: RecordDecoder,
+    active_decoder: RecordDecoder,
+    active_fingerprint: Option<Fingerprint>,
     batch_size: usize,
-    decoded_rows: usize,
+    remaining_capacity: usize,
+    #[cfg(feature = "lru")]
+    cache: LruCache<Fingerprint, RecordDecoder>,
+    #[cfg(not(feature = "lru"))]
+    cache: IndexMap<Fingerprint, RecordDecoder>,
+    max_cache_size: usize,
+    reader_schema: Option<AvroSchema<'static>>,
+    writer_schema_store: Option<SchemaStore<'static>>,

Review Comment:
   @scovich You're right. Here's where my research led me: Requiring `'static` 
lifetimes for Avro schemas means either restricting schemas to static data or 
leaking heap memory for dynamic schema components.
   
   One thing I noticed is that the `HeaderDecoder` seems able to get around 
this by parsing a json string from the first row of an Avro file during the 
`ReaderBuilder::build` process, i.e.
   
   ```rust
       pub fn schema(&self) -> Result<Option<Schema<'_>>, ArrowError> {
           self.get(SCHEMA_METADATA_KEY)
               .map(|x| {
                   serde_json::from_slice(x).map_err(|e| {
                       ArrowError::ParseError(format!("Failed to parse Avro 
schema JSON: {e}"))
                   })
               })
               .transpose()
       }
   ```
   
   Do you think an approach that uses JSON Strings instead of `Schema`, for 
instance:
   
   ```rust
       /// Sets the Avro reader schema.
       ///
       /// If a schema is not provided, the schema will be read from the Avro 
file header.
       pub fn with_reader_schema(mut self, reader_schema: String) -> Self {
           ...
       }
   
       /// Sets the `SchemaStore` used for resolving writer schemas.
       ///
       /// This is necessary when decoding single-object encoded data that 
identifies
       /// schemas by a fingerprint. The store allows the decoder to look up the
       /// full writer schema from a fingerprint embedded in the data.
       ///
       /// Defaults to `None`.
       pub fn with_writer_schema_store(mut self, store: HashMap<Fingerprint, 
String>) -> Self {
           ...
       }
   ```
   
   Then parses the each `Schema` inside the `build` and `build_decoder` methods 
would be a way around this?
   
   UPDATE:
   
   Maybe we could also pre-build the `cache: IndexMap<Fingerprint, 
RecordDecoder>,` from JSON strings in the `ReaderBuilder::build_decoder` method 
and then pass that into the new `Decoder` so we can remove:
   
   ```rust
       reader_schema: Option<AvroSchema<'static>>,
       writer_schema_store: Option<SchemaStore<'static>>,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution [arrow-rs]

Reply via email to