It is possible to construct an ambiguous schema using the latest Avro
specification. Before I file a JIRA issue, I want to check whether this
is a known deficiency. I believe this is a bug in the specification, not
any particular implementation.
Types can be defined in the null namespace, and then those types can be
referenced later. Such a reference would not contain any dots. For
example, if we define the type "Target" in the null namespace, we can
refer to it with the fullname "Target". However, the specification says
that when a reference has no dot, "the namespace is the namespace of the
enclosing definition." That means we could define a different type
"Target" in the namespace "org.apache.avro". It could be referenced with
the fullname "org.apache.avro.Target". If the enclosing namespace is
already "org.apache.avro", then it could also be referenced with the
simple name "Target". The problem arises when a single schema includes
both those types, and "Target" is a valid reference to either one.
In short, it is impossible to distinguish a qualified name that happens
to be in the null namespace from a simple name. The specification
creates this problem by neglecting the null namespace when it defines a
fullname as "composed of two parts: a name and a namespace, separated by
a dot."
This could be solved by simply resolving all ambiguities in favor of the
null namespace reference. For example, the reference "Target" should be
interpreted as a fullname if such a type exists and as a simple name
otherwise. If the author didn't intend to reference into the null
namespace, then they can unambiguously use a fullname reference instead.
Any solution will create compatibility concerns, so first I just want to
discuss whether this is believed to be a problem.
The following complete test case illustrates how this issue leads to
data corruption with the Java API. Note that the Java implementation
neither detects the ambiguity nor resolves it the way I am recommending.
@Test
void testAmbiguousReference() {
final Schema target = SchemaBuilder.builder()
.record("Target")
.doc("right")
.fields()
.endRecord();
final Schema decoy = SchemaBuilder.builder()
.record(target.getName())
.namespace("org.apache.avro")
.doc("wrong")
.fields()
.endRecord();
final Schema ambiguous = SchemaBuilder.builder()
.record("Ambiguous")
.fields()
.name("definition")
.type(target)
.noDefault()
.name("working")
.type(target)
.noDefault()
.name("enclosing")
.type(SchemaBuilder.builder()
.record("Enclosing")
.namespace("org.apache.avro")
.fields()
.name("decoy")
.type(decoy)
.noDefault()
.name("working")
.type(decoy)
.noDefault()
.name("broken")
.type(target)
.noDefault()
.endRecord())
.noDefault()
.endRecord();
final Schema parsed = new Schema.Parser().parse(
ambiguous.toString());
// This assertion succeeds.
Assertions.assertEquals(
ambiguous.getField("working").schema(),
parsed.getField("working").schema());
// This assertion succeeds but the specification is unclear.
Assertions.assertEquals(
ambiguous.getField("enclosing").schema()
.getField("working").schema(),
parsed.getField("enclosing").schema()
.getField("working").schema());
// This assertion FAILS.
Assertions.assertEquals(
ambiguous.getField("enclosing").schema()
.getField("broken").schema(),
parsed.getField("enclosing").schema()
.getField("broken").schema());
}
The assertion failure message complains:
expected: <{"type":"record","name":"Target","doc":"right","fields":[]}>
but was:
<{"type":"record","name":"Target","namespace":"org.apache.avro","doc":"wrong","fields":[]}>