KitFieldhouse opened a new issue, #353:
URL: https://github.com/apache/avro-rs/issues/353
Hi all, I have been digging into how schema parsing/resolution is done by
this implementation and comparing its features and functionality compared to
what I am familiar with Java. I wanted to collect some thoughts and see what
the community thinks and, if there is interest, start towards getting it
implemented.
## Observed Behavior
First some bugs/unexpected behavior.
### Schema parse ordering
With the current implementation of the parser, when parsing a list of
schemata, the order in which the schemata are parsed matters, however, there is
no guarantee that the schemata are parsed in input order.
For instance, we can only reference a named schema type if it has already
been parsed by the parser (with an exception being if the named schema happens
to be the root schema of one of the provided schemata). However since there is
no guarantee that the provided schemata are actually parsed in input order,
sometimes parsing will succeed and sometimes it will fail. This can be tested
by running this several times in a row.
```rust
#[test]
fn test_ordering() -> TestResult {
let schema1 = r#"
{
"type": "record",
"name": "record_1",
"fields": [
{
"name": "a_field",
"type": {
"type": "record",
"name": "user",
"fields": [
{
"name": "age",
"type": "float"
}
]
}
}
]
}
"#;
let schema2 = r#"{
"type": "record",
"name": "a_user",
"fields": [{"name": "the_user", "type": "user"}]
}"#;
Schema::parse_list(vec![&schema1, &schema2])?;
Ok(())
}
```
Further, even if we parse in input order, there is no forward non-root level
resolution of named schemata. For instance, I do not think there is any order
in which to supply this pair of schemata to `parse_list` that will succeed.
```rust
let schema1 = r#"
{
"name": "holds_foo_record",
"type": "record",
"fields": [{"name": "holds_foo", "type":
{
"name": "foo_record",
"type": "record",
"fields": [{"name": "foo", "type": "baz_record"}]
}
}]
}
"#;
let schema2 = r#"
{
"name": "holds_baz_record",
"type": "record",
"fields": [{"name": "holds_baz", "type":
{
"name": "baz_record",
"type": "record",
"fields": [{"name": "baz", "type": "foo_record"}]
}}]
}
"#;
```
Maybe it is possible to parse and resolve this type of schemata by using
`parse_with_names`?
The Avro code generator for Java parses and handles this schema with no
issue.
It occurs to me that, as far as I can tell, the Avro specification doesn't
make any specific guarantees about how parsing and resolution should be carried
out when we provide a list of schemata. The spec only seems to concern itself
with name resolution within the context of a single schema
(https://avro.apache.org/docs/++version++/specification/#names).
With that being said, I think it would make sense to give lots of
flexibility in using names in one schema that may have a definition nested in
another.
### Parser “resolution” vs `ResolvedSchema` resolution
The documentation mentions that `Schema::parse_list` “resolves”
cross-dependencies when given a list of schemas. From experimentation, it seems
this means:
“We verify that there exists some schema with the given name somewhere in
the input set (or as a root),”
rather than:
“We permanently bind each named reference to a specific Schema definition.”
For instance, this test fails:
```rust
#[test]
fn test_resolution() -> TestResult {
let schema1 = r#"{
"type": "record",
"name": "schema_a",
"fields": [{"name": "a_field", "type": "long"}]
}"#;
let schema2 = r#"{
"type": "schema_a",
"name": "a_test"
}"#;
let schema3 = r#"{
"type": "record",
"name": "schema_a",
"fields": [{"name": "a_field", "type": "float"}]
}"#;
let schema12 = Schema::parse_list(vec![&schema1, &schema2]).unwrap();
let schema3 = Schema::parse_str(&schema3).unwrap();
assert_eq!(
"schema_a",
&schema12[1].name().unwrap().fullname(None)
);
let resolved = ResolvedSchema::try_from(vec![&schema3,
&schema12[1]]).unwrap();
// Intuitively, I would expect `schema12[1]` to still refer to the same
// `schema_a` definition as it did during parsing (`schema12[0]`).
let schema2_resolved =
resolved.names_ref.get(&schema12[0].name().unwrap()).unwrap();
assert_eq!(**schema2_resolved, schema12[0]); // currently fails
Ok(())
}
```
The parser is happy with `schema1` + `schema2` as long as there exists some
`schema_a`.
Later, `ResolvedSchema::try_from` gets both `schema3` and `schema12[1]` and
builds its own name map which disagrees with the parser’s earlier “resolution”
of `schema_a`.
I don’t think this is necessarily a bug per se, but it means that what the
parser calls “resolution” is different from what `ResolvedSchema` does, and the
two are not mutually consistent.
### Aliasing: parser vs ResolvedSchema
In addition, in the current implementation, what the parser will "resolve"
compared to what `ResolvedSchema` attempts to resolve has slightly different
behavior. Indeed, this can be observed to be the case with this test, which,
under the current implementation, fails.
```rust
#[test]
fn test_aliasing() -> TestResult {
let schema1 = r#"{
"type": "record",
"name": "a_record_schema",
"aliases": ["an_alias"],
"fields": [{"name": "a_field", "type": "long"}]
}"#;
let schema2 = r#"{
"type": "an_alias",
"name": "a_test"
}"#;
let schema = Schema::parse_list(vec![&schema1, &schema2]).unwrap();
let schemata : Vec<&Schema> = schema.iter().collect();
let resolved = ResolvedSchema::try_from(schemata).unwrap();
return Ok(())
}
```
This is because, while the parsing resolution algorithm allows for aliasing,
`ResolvedSchema` does not take this into account.
### Using names equal to a complex type name.
Another place where the Rust implementation differs from Java is when
defining a named schema whose name is one of the complex type names ("record",
"enum", "fixed").
From playing with the code, I believe the Rust implementation behaves
roughly like:
- In field type positions (inside "fields"), `"type": "record"` refers to a
previously defined named schema "record" if one exists.
- In JSON schema objects, `"type": "record"` is first interpreted as the
built-in complex type and only falls back to a previously-defined schema named
"record" if the object doesn’t look like a record (e.g., it lacks "fields").
This means the following schema parses in Rust. Java, on the other hand,
rejects this schema:
```rust
#[test]
fn test_using_name_record() -> TestResult {
let schema = r#"[
{
"type": "record",
"name": "record",
"fields": [
{"name": "foo", "type": "long"}
]
},
{
"type": "record",
"name": "user"
}
]"#;
Schema::parse_str(schema).map_or(Ok(()), |_| {
panic!("Java does not parse this!")
})
}
```
Intuitively, it seems more predictable if `"type": "record"` in a schema
object always meant the built-in complex type "record" (as in Java), instead of
depending on whether we can reinterpret it as a reference to a named type.
## Possible improvements?
I think the root of the problem comes from the need to cleanly separate
where we perform simple "dumb" schema parsing, and where we perform schema name
resolution. Right now, part of this is handled by `Schema::parse` and other
such associated functions, and some of it is handled by `ResolvedSchema`, but
neither of them constitute a full resolution step and aren't mutually
consistent.
I think making this clean separation would both simplify the existing API as
well as make the implementation easier to reason through.
In broad strokes, the idea is to have `Parser` be rather simple in its
checks -- only checking situations that will always result in an error like
redefining the same schema name twice. Then, we form a resolution context where
it is required that each referenced name has a single unique implementation
that it can resolve to. Further, this context could be made to take a custom
resolver function to use when it can't find a schema name internally. Something
like this could be extended to build out plugins for interacting with schema
registries.
What do you all think? I have started down the route of getting some of this
implemented, but would greatly appreciate any thoughts, especially if I am
missing something in my analysis!
Go Avro!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]