jecsand838 commented on code in PR #8006:
URL: https://github.com/apache/arrow-rs/pull/8006#discussion_r2248715880


##########
arrow-avro/src/schema.rs:
##########
@@ -15,12 +15,27 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use arrow_schema::ArrowError;
 use serde::{Deserialize, Serialize};
+use serde_json::{json, Value};
+use std::collections::hash_map::Entry;
 use std::collections::HashMap;
+use strum_macros::AsRefStr;
 
 /// The metadata key used for storing the JSON encoded [`Schema`]
 pub const SCHEMA_METADATA_KEY: &str = "avro.schema";
 
+/// The Avro single‑object encoding “magic” bytes (`0xC3 0x01`)
+pub const SINGLE_OBJECT_MAGIC: [u8; 2] = [0xC3, 0x01];

Review Comment:
   @scovich From the research I was doing it seems what prevents false 
positives is having the magic bytes **located at the start** of the message.  
   
   Here's a twist to consider though, the `Decoder::decoder` method was 
developed to be generic enough to handle both:
   1. OCF block encodings (blocks of Avro records without fingerprints, writer 
schema is determined from OCF header which has it's own 4 magic byte sequence)
   2. Message single object encodings (an encoding of a single Avro record 
starting with the 2 byte magic + 8 byte fingerprint OR 1 byte magic + 4 byte id 
fingerprint if Confluent variant).
   
   I'm beginning to wonder if we'd benefit from having two decoder methods. 
   
   For instance both OCF and the message format support schema evolution via 
writer and reader schemas. In message format the writer schema is determined 
via a fingerprint located on **each Avro message** that may or may not change 
between messages. Meanwhile in OCF format the writer schema is provided in the 
header and will remain static for the full file which in turn is a sequence of 
block encodings comprised of individual Avro records.
   
   My original plan was to use the `Reader` for OCF and the stand-alone 
`Decoder` for Messages (while being re-used in the `Reader`). I'm beginning to 
question that approach.
   
   A second alternative is to define a `DecodeMode` enum for the `Decoder` to 
used for identifying which decode flow is required.
   
   Either way I need to make some modifications based on this new research. 
I'll have something pushed up by this evening. Ty for bringing this up!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to