[
https://issues.apache.org/jira/browse/AVRO-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thiruvalluvan M. G. updated AVRO-2159:
--------------------------------------
Component/s: spec
> Naming Limitations of Schemas in Stricter Reference Contexts
> ------------------------------------------------------------
>
> Key: AVRO-2159
> URL: https://issues.apache.org/jira/browse/AVRO-2159
> Project: Apache Avro
> Issue Type: New Feature
> Components: spec
> Reporter: Bridger Howell
> Priority: Major
>
> (Excuse the lengthiness of this ticket description - it was initially written
> as an email that became too long. Feel free to correct any misguided
> reasoning.)
> I've come to realize that there are some undesirable constraints on how avro
> schemas can be used in Java code generation and IDL, that only appear as
> minor annoyances when you use schemas generically. In particular, I'm focused
> on cases where it's desirable to use two schemas that have the same name in
> some context.
>
> *Issue:*
> Suppose I'm writing an application that publishes a many different kinds of
> data somewhere, with each type of data having its own schema. And then
> suppose that a some number of those schemas would like to share some kind of
> common schema, to start with.
>
> If I do this, and I happen to be using Java code generation to manage
> schemas, I'll soon find difficulty in two directions:
>
> - I would find it difficult to upgrade the data shared among all of these
> external schemas by way of the common schema, without upgrading all of those
> schemas at the same time. The problem here being that neither Java's
> classpath nor an IDL protocol can support the way avro's name field maps as a
> class name onto the classpath or a reference name onto a protocol's symbols.
>
> The intermediate step of the application being partially migrated between
> version 1 and version 2 of a common schema has no representation in either of
> these contexts. Using a different name becomes a very annoying option in many
> cases, since it is an incompatible change (or with aliases, it's at least not
> consistently compatible across implementations).
> - I would find it difficult to migrate away from the external schemas using
> that shared schema, for the same reasons listed above.
> In IDL (without code generation), these issues can usually be avoided by
> creating a second protocol, and in generic avro, the issues would be avoided
> by using a different schema parser or schema builder.
>
> *Analysis:*
> At first glance, it is tempting to blame the name-matching requirement for
> schema resolution as a culprit - and it may be correct in many cases that
> requiring schemas have compatible structure is all that is needed.
>
> However, the way I see it is that the name-matching requirement for schema
> resolution is there to ensure that there is _the intent for two schemas to
> resolve with each other_, and the rest of the checks are just there to make
> sure that such an intent can be reasonably carried out.
>
> The difficulty from either the two examples above happens not because of a
> lack of pre-determined intent for schemas to resolve, but rather the
> inability to simultaneously supply a unique reference for each of the
> schemas, while intending that the correct groups of schemas can resolve.
>
> Thus, the way to avoid these issues so far has been to create a new
> reference context, and the severity of the issue in each case corresponds to
> the difficulty of creating a new reference context:
> * For generic schemas, create a new parser or schema builder [easy - minorly
> annoying]
> * For IDL, create a new protocol [minorly annoying - somewhat annoying]
> * For Java code generation, create a new classpath [very annoying (Java 9) -
> impossible]
> Based on that, I understand a schema's name as expressing two overlapping
> meanings:
> - the intent to be able to resolve with other schemas with the same name
> (let's call this the {{resolveName}})
> - the ability to be uniquely referenced from some context (let's call this
> the {{referenceName}})
>
> If these two meanings were able to be specified independently, I think that
> schemas would be much easier to use in contexts where references are more
> limited.
>
> *Speculative Solutions:*
> Minimally, I think it's reasonable to create at least one new field to
> separate the meaning of a schema's {{referenceName}} from its
> {{resolveName}}, and use the old name field to compatibly handle missing
> values. Then other tools that don't immediately apply schema resolution, can
> optionally upgrade to support using the {{referenceName}} instead of the
> {{resolveName}}.
>
> Beyond that, having {{name}} continue to mean {{resolveName}} would mean
> that old avro implementations would be able to treat newer schemas as valid
> and resolve against them correctly. So I think it's reasonable to say that
> {{referenceName}} should be the new field introduced (not necessarily with
> that name).
>
> Assuming I've made no mistakes up until this point, there are some remaining
> questions:
>
> 1. How should this appear in IDL? There are two solutions that come to mind,
> using the existing intuition of how annotations work:
>
> a. Declared as: {{@ref("UserV2") record User {}}}
> Used as: {{@ref("UserV2") User user;}}
>
> - Can be quite verbose
> - The annotated type ({{User}}) only really exists as a placeholder
> - Requires a new error message for when a type is used but needs a
> reference name.
> b. Declared as: {{@name("User") record UserV2 {}}}
> Used as: {{UserV2 user;}}
>
> - Relies on a weird intuition of how a "type name" maps to a raw
> schema. Normally, the type name becomes the schema's name field, but when the
> {{name}} field is specified by annotation, the type name just becomes the
> {{referenceName}} field.
> - Requires more work to change the implementation of type annotations,
> since {{name}} is normally a reserved field.
> - Matches the current intuition that two schemas resolve only if their
> {{name}} fields match, and that type names are always used for references.
>
> 2. How should the {{namespace}} field be handled?
> Should both names share the {{namespace}} field and act like the name field
> does now (where any name that contains a dot is assumed to specify the full
> name)? Or should reference names ignore the {{namespace}} field (and maybe
> have one of their own)?
>
> 3. If avro parsers are changed to be use {{referenceName}} instead of
> {{name}}, when present, how much concern is there about old parsers not being
> able to parse the new {{referenceName}} field if they are used sparingly
> (only generated when necessary)?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)