Bridger Howell created AVRO-2159:
------------------------------------
Summary: Naming Limitations of Schemas in Stricter Reference
Contexts
Key: AVRO-2159
URL: https://issues.apache.org/jira/browse/AVRO-2159
Project: Avro
Issue Type: New Feature
Reporter: Bridger Howell
(Excuse the lengthiness of this ticket description - it was initially written
as an email that became too long. Feel free to correct any misguided reasoning.)
I've come to realize that there are some undesirable constraints on how avro
schemas can be used in Java code generation and IDL, that only appear as minor
annoyances when you use schemas generically. In particular, I'm focused on
cases where it's desirable to use two schemas that have the same name in some
context.
*Issue:*
Suppose I'm writing an application that publishes a many different kinds of
data somewhere, with each type of data having its own schema. And then suppose
that a some number of those schemas would like to share some kind of common
schema, to start with.
If I do this, and I happen to be using Java code generation to manage schemas,
I'll soon find difficulty in two directions:
- I would find it difficult to upgrade the data shared among all of these
external schemas by way of the common schema, without upgrading all of those
schemas at the same time. The problem here being that neither Java's classpath
nor an IDL protocol can support the way avro's name field maps as a class name
onto the classpath or a reference name onto a protocol's symbols.
The intermediate step of the application being partially migrated between
version 1 and version 2 of a common schema has no representation in either of
these contexts. Using a different name becomes a very annoying option in many
cases, since it is an incompatible change (or with aliases, it's at least not
consistently compatible across implementations).
- I would find it difficult to migrate away from the external schemas using
that shared schema, for the same reasons listed above.
In IDL (without code generation), these issues can usually be avoided by
creating a second protocol, and in generic avro, the issues would be avoided by
using a different schema parser or schema builder.
*Analysis:*
At first glance, it is tempting to blame the name-matching requirement for
schema resolution as a culprit - and it may be correct in many cases that
requiring schemas have compatible structure is all that is needed.
However, the way I see it is that the name-matching requirement for schema
resolution is there to ensure that there is _the intent for two schemas to
resolve with each other_, and the rest of the checks are just there to make
sure that such an intent can be reasonably carried out.
The difficulty from either the two examples above happens not because of a
lack of pre-determined intent for schemas to resolve, but rather the inability
to simultaneously supply a unique reference for each of the schemas, while
intending that the correct groups of schemas can resolve.
Thus, the way to avoid these issues so far has been to create a new reference
context, and the severity of the issue in each case corresponds to the
difficulty of creating a new reference context:
* For generic schemas, create a new parser or schema builder [easy - minorly
annoying]
* For IDL, create a new protocol [minorly annoying - somewhat annoying]
* For Java code generation, create a new classpath [very annoying (Java 9) -
impossible]
Based on that, I understand a schema's name as expressing two overlapping
meanings:
- the intent to be able to resolve with other schemas with the same name
(let's call this the {{resolveName}})
- the ability to be uniquely referenced from some context (let's call this the
{{referenceName}})
If these two meanings were able to be specified independently, I think that
schemas would be much easier to use in contexts where references are more
limited.
*Speculative Solutions:*
Minimally, I think it's reasonable to create at least one new field to
separate the meaning of a schema's {{referenceName}} from its {{resolveName}},
and use the old name field to compatibly handle missing values. Then other
tools that don't immediately apply schema resolution, can optionally upgrade to
support using the {{referenceName}} instead of the {{resolveName}}.
Beyond that, having {{name}} continue to mean {{resolveName}} would mean that
old avro implementations would be able to treat newer schemas as valid and
resolve against them correctly. So I think it's reasonable to say that
{{referenceName}} should be the new field introduced (not necessarily with that
name).
Assuming I've made no mistakes up until this point, there are some remaining
questions:
1. How should this appear in IDL? There are two solutions that come to mind,
using the existing intuition of how annotations work:
a. Declared as: {{@ref("UserV2") record User {}}}
Used as: {{@ref("UserV2") User user;}}
- Can be quite verbose
- The annotated type ({{User}}) only really exists as a placeholder
- Requires a new error message for when a type is used but needs a
reference name.
b. Declared as: {{@name("User") record UserV2 {}}}
Used as: {{UserV2 user;}}
- Relies on a weird intuition of how a "type name" maps to a raw schema.
Normally, the type name becomes the schema's name field, but when the {{name}}
field is specified by annotation, the type name just becomes the
{{referenceName}} field.
- Requires more work to change the implementation of type annotations,
since {{name}} is normally a reserved field.
- Matches the current intuition that two schemas resolve only if their
{{name}} fields match, and that type names are always used for references.
2. How should the {{namespace}} field be handled?
Should both names share the {{namespace}} field and act like the name field
does now (where any name that contains a dot is assumed to specify the full
name)? Or should reference names ignore the {{namespace}} field (and maybe have
one of their own)?
3. If avro parsers are changed to be use {{referenceName}} instead of
{{name}}, when present, how much concern is there about old parsers not being
able to parse the new {{referenceName}} field if they are used sparingly (only
generated when necessary)?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)