[I] schema parsing/resolution improvements? [avro-rs]

via GitHub Thu, 04 Dec 2025 11:53:34 -0800


KitFieldhouse opened a new issue, #353:
URL: https://github.com/apache/avro-rs/issues/353


   Hi all, I have been digging into how schema parsing/resolution is done by 
this implementation and comparing its features and functionality compared to 
what I am familiar with Java. I wanted to collect some thoughts and see what 
the community thinks and, if there is interest, start towards getting it 
implemented.
   
   ## Observed Behavior
   
   First some bugs/unexpected behavior.
   
   ### Schema parse ordering
   With the current implementation of the parser, when parsing a list of 
schemata, the order in which the schemata are parsed matters, however, there is 
no guarantee that the schemata are parsed in input order. 
   
   For instance, we can only reference a named schema type if it has already 
been parsed by the parser (with an exception being if the named schema happens 
to be the root schema of one of the provided schemata). However since there is 
no guarantee that the provided schemata are actually parsed in input order, 
sometimes parsing will succeed and sometimes it will fail. This can be tested 
by running this several times in a row. 
   
   ```rust
       #[test]
       fn test_ordering() -> TestResult {
           let schema1 = r#"
              {
               "type": "record", 
               "name": "record_1",
               "fields": [
                   {
                       "name": "a_field",
                       "type": {
                           "type": "record",
                           "name": "user",
                           "fields": [
                                   {
                                       "name": "age",
                                       "type": "float"
                                   }
                           ]
                       } 
                    }
                 ]
               }
               "#;
   
           let schema2 =  r#"{
               "type": "record",
               "name": "a_user",
               "fields": [{"name": "the_user", "type": "user"}]
           }"#;
   
   
           Schema::parse_list(vec![&schema1, &schema2])?;
           Ok(())
       }
   
   ```
   
   Further, even if we parse in input order, there is no forward non-root level 
resolution of named schemata. For instance, I do not think there is any order 
in which to supply this pair of schemata to `parse_list` that will succeed.
   
   ```rust
           let schema1 = r#"
               {
                   "name": "holds_foo_record",
                   "type": "record",
                   "fields": [{"name": "holds_foo", "type": 
                       {
                           "name": "foo_record",
                           "type": "record",
                           "fields": [{"name": "foo", "type": "baz_record"}]
                       }
                   }]
               }
               "#;    
   
           let schema2 = r#"
               {
                   "name": "holds_baz_record",
                   "type": "record",
                   "fields": [{"name": "holds_baz", "type": 
                       {
                           "name": "baz_record",
                           "type": "record",
                           "fields": [{"name": "baz", "type": "foo_record"}]
                       }}]
               }
               "#;    
   ```
   Maybe it is possible to parse and resolve this type of schemata by using 
`parse_with_names`? 
   
   The Avro code generator for Java parses and handles this schema with no 
issue.
   
   It occurs to me that, as far as I can tell, the Avro specification doesn't 
make any specific guarantees about how parsing and resolution should be carried 
out when we provide a list of schemata. The spec only seems to concern itself 
with name resolution within the context of a single schema 
(https://avro.apache.org/docs/++version++/specification/#names).
   
   With that being said, I think it would make sense to give lots of 
flexibility in using names in one schema that may have a definition nested in 
another.
   
   
   ### Parser “resolution” vs `ResolvedSchema` resolution
   The documentation mentions that `Schema::parse_list` “resolves” 
cross-dependencies when given a list of schemas. From experimentation, it seems 
this means:
   
   “We verify that there exists some schema with the given name somewhere in 
the input set (or as a root),”
   
   rather than:
   
   “We permanently bind each named reference to a specific Schema definition.”
   
   For instance, this test fails:
   
   ```rust
   #[test]
   fn test_resolution() -> TestResult {
      let schema1 = r#"{
           "type": "record",
           "name": "schema_a",
           "fields": [{"name": "a_field", "type": "long"}]
      }"#;
   
      let schema2 = r#"{
           "type": "schema_a",
           "name": "a_test"
      }"#;
   
      let schema3 = r#"{
           "type": "record",
           "name": "schema_a",
           "fields": [{"name": "a_field", "type": "float"}]
      }"#;
   
      let schema12 = Schema::parse_list(vec![&schema1, &schema2]).unwrap();
      let schema3 = Schema::parse_str(&schema3).unwrap();
   
      assert_eq!(
          "schema_a",
          &schema12[1].name().unwrap().fullname(None)
      );
   
      let resolved = ResolvedSchema::try_from(vec![&schema3, 
&schema12[1]]).unwrap();
   
      // Intuitively, I would expect `schema12[1]` to still refer to the same
      // `schema_a` definition as it did during parsing (`schema12[0]`).
      let schema2_resolved = 
resolved.names_ref.get(&schema12[0].name().unwrap()).unwrap();
      assert_eq!(**schema2_resolved, schema12[0]); // currently fails
   
      Ok(())
   }
   ```
   The parser is happy with `schema1` + `schema2` as long as there exists some 
`schema_a`.
   
   Later, `ResolvedSchema::try_from` gets both `schema3` and `schema12[1]` and 
builds its own name map which disagrees with the parser’s earlier “resolution” 
of `schema_a`.
   
   I don’t think this is necessarily a bug per se, but it means that what the 
parser calls “resolution” is different from what `ResolvedSchema` does, and the 
two are not mutually consistent.
   
   ### Aliasing: parser vs ResolvedSchema
   
   In addition, in the current implementation, what the parser will "resolve" 
compared to what `ResolvedSchema` attempts to resolve has slightly different 
behavior. Indeed, this can be observed to be the case with this test, which, 
under the current implementation, fails.
   ```rust
       #[test] 
       fn test_aliasing() -> TestResult {
          let schema1 = r#"{
               "type": "record",
               "name": "a_record_schema",
               "aliases": ["an_alias"],
               "fields": [{"name": "a_field", "type": "long"}]
          }"#;
   
          let schema2 = r#"{
               "type": "an_alias",
               "name": "a_test"
          }"#;
   
          let schema = Schema::parse_list(vec![&schema1, &schema2]).unwrap();
          let schemata : Vec<&Schema> = schema.iter().collect();
   
          let resolved = ResolvedSchema::try_from(schemata).unwrap();
   
          return Ok(())
           
       }
   ```
   
   This is because, while the parsing resolution algorithm allows for aliasing, 
`ResolvedSchema` does not take this into account.
   
   
   ### Using names equal to a complex type name.
   Another place where the Rust implementation differs from Java is when 
defining a named schema whose name is one of the complex type names ("record", 
"enum", "fixed").
   
   From playing with the code, I believe the Rust implementation behaves 
roughly like:
   
   - In field type positions (inside "fields"), `"type": "record"` refers to a 
previously defined named schema "record" if one exists.
   
   - In JSON schema objects, `"type": "record"` is first interpreted as the 
built-in complex type and only falls back to a previously-defined schema named 
"record" if the object doesn’t look like a record (e.g., it lacks "fields").
   
   This means the following schema parses in Rust. Java, on the other hand, 
rejects this schema:
   ```rust
   #[test]
   fn test_using_name_record() -> TestResult {
       let schema = r#"[
           {
               "type": "record",
               "name": "record",
               "fields": [
                   {"name": "foo", "type": "long"}
               ]
           },
           {
               "type": "record",
               "name": "user"
           }
       ]"#;
   
       Schema::parse_str(schema).map_or(Ok(()), |_| {
           panic!("Java does not parse this!")
       })
   }
   ```
   
   Intuitively, it seems more predictable if `"type": "record"` in a schema 
object always meant the built-in complex type "record" (as in Java), instead of 
depending on whether we can reinterpret it as a reference to a named type.
   
   
   ## Possible improvements? 
   
   I think the root of the problem comes from the need to cleanly separate 
where we perform simple "dumb" schema parsing, and where we perform schema name 
resolution. Right now, part of this is handled by `Schema::parse` and other 
such associated functions, and some of it is handled by `ResolvedSchema`, but 
neither of them constitute a full resolution step and aren't mutually 
consistent.
   
   I think making this clean separation would both simplify the existing API as 
well as make the implementation easier to reason through. 
   
   In broad strokes, the idea is to have `Parser` be rather simple in its 
checks -- only checking situations that will always result in an error like 
redefining the same schema name twice. Then, we form a resolution context where 
it is required that each  referenced name has a single unique implementation 
that it can resolve to. Further, this context could be made to take a custom 
resolver function to use when it can't find a schema name internally. Something 
like this could be extended to build out plugins for interacting with schema 
registries. 
   
   What do you all think? I have started down the route of getting some of this 
implemented, but would greatly appreciate any thoughts, especially if I am 
missing something in my analysis!
   
   Go Avro!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] schema parsing/resolution improvements? [avro-rs]

Reply via email to