jecsand838 opened a new pull request, #8039:
URL: https://github.com/apache/arrow-rs/pull/8039

   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   
   - Pre-work for https://github.com/apache/arrow-rs/pull/8006
   
   # Rationale for this change
   
   Apache Avro’s [single object 
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
 prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` 
[schema fingerprint 
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so 
that readers can identify the correct writer schema without carrying the full 
definition in each message. 
   While the current `arrow‑avro` implementation can read container files, it 
cannot ingest these framed messages or handle streams where the writer schema 
changes over time.
   
   The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) 
hashed fingerprint of the [parsed canonical form of a 
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
 to look up the `Schema` from a local schema store or registry.
   
   This PR introduces **`SchemaStore`** and **fingerprinting** to enable:
   
   * **Zero‑copy schema identification** for decoding streaming Avro messages 
published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow.  
   * **Dynamic schema evolution** by laying the foundation to resolve writer 
reader schema differences on the fly. 
   
   **NOTE:**  Integration with `Decoder` and `Reader` coming in next PR.
   
   # What changes are included in this PR?
   
   | Area                | Highlights                                           
                                                                                
                                                                                
                                                   |
   | ------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | **`schema.rs`**     | *New* `Fingerprint`, `SchemaStore`, and 
`SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; 
`compare_schemas` helper.                                                       
             |
   | **`lib.rs`**      | `mod schema` is now `pub`                              
                                                                                
    |
   | **Unit tests**      |  New tests covering fingerprint generation, store 
registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view 
decoding.                                                                       
                                |
   | **Docs & Examples** | Extensive inline docs with examples on all new 
public methods / structs.                                                       
                                                                     |
   
   
   # Are these changes tested?
   
   Yes.  New tests cover:
   
   1. **Fingerprinting** against the canonical examples from the Avro spec
   2. **`SchemaStore` behavior** deduplication, duplicate registration, and 
lookup.
   
   # Are there any user-facing changes?
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to