[ 
https://issues.apache.org/jira/browse/AVRO-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-2159:
--------------------------------------
    Component/s: spec

> Naming Limitations of Schemas in Stricter Reference Contexts
> ------------------------------------------------------------
>
>                 Key: AVRO-2159
>                 URL: https://issues.apache.org/jira/browse/AVRO-2159
>             Project: Apache Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Bridger Howell
>            Priority: Major
>
> (Excuse the lengthiness of this ticket description - it was initially written 
> as an email that became too long. Feel free to correct any misguided 
> reasoning.)
> I've come to realize that there are some undesirable constraints on how avro 
> schemas can be used in Java code generation and IDL, that only appear as 
> minor annoyances when you use schemas generically. In particular, I'm focused 
> on cases where it's desirable to use two schemas that have the same name in 
> some context.
>   
>  *Issue:*
>  Suppose I'm writing an application that publishes a many different kinds of 
> data somewhere, with each type of data having its own schema. And then 
> suppose that a some number of those schemas would like to share some kind of 
> common schema, to start with.
>   
>  If I do this, and I happen to be using Java code generation to manage 
> schemas, I'll soon find difficulty in two directions:
>   
>  - I would find it difficult to upgrade the data shared among all of these 
> external schemas by way of the common schema, without upgrading all of those 
> schemas at the same time. The problem here being that neither Java's 
> classpath nor an IDL protocol can support the way avro's name field maps as a 
> class name onto the classpath or a reference name onto a protocol's symbols.
>   
>  The intermediate step of the application being partially migrated between 
> version 1 and version 2 of a common schema has no representation in either of 
> these contexts. Using a different name becomes a very annoying option in many 
> cases, since it is an incompatible change (or with aliases, it's at least not 
> consistently compatible across implementations).
>  - I would find it difficult to migrate away from the external schemas using 
> that shared schema, for the same reasons listed above.
> In IDL (without code generation), these issues can usually be avoided by 
> creating a second protocol, and in generic avro, the issues would be avoided 
> by using a different schema parser or schema builder.
>   
>  *Analysis:*
>  At first glance, it is tempting to blame the name-matching requirement for 
> schema resolution as a culprit - and it may be correct in many cases that 
> requiring schemas have compatible structure is all that is needed.
>   
>  However, the way I see it is that the name-matching requirement for schema 
> resolution is there to ensure that there is _the intent for two schemas to 
> resolve with each other_, and the rest of the checks are just there to make 
> sure that such an intent can be reasonably carried out.
>   
>  The difficulty from either the two examples above happens not because of a 
> lack of pre-determined intent for schemas to resolve, but rather the 
> inability to simultaneously supply a unique reference for each of the 
> schemas, while intending that the correct groups of schemas can resolve.
>   
>  Thus, the way to avoid these issues so far has been to create a new 
> reference context, and the severity of the issue in each case corresponds to 
> the difficulty of creating a new reference context:
>  * For generic schemas, create a new parser or schema builder [easy - minorly 
> annoying]
>  * For IDL, create a new protocol [minorly annoying - somewhat annoying]
>  * For Java code generation, create a new classpath [very annoying (Java 9) - 
> impossible]
> Based on that, I understand a schema's name as expressing two overlapping 
> meanings:
>  - the intent to be able to resolve with other schemas with the same name 
> (let's call this the {{resolveName}})
>  - the ability to be uniquely referenced from some context (let's call this 
> the {{referenceName}})
>  
>  If these two meanings were able to be specified independently, I think that 
> schemas would be much easier to use in contexts where references are more 
> limited.
>   
>  *Speculative Solutions:*
>  Minimally, I think it's reasonable to create at least one new field to 
> separate the meaning of a schema's {{referenceName}} from its 
> {{resolveName}}, and use the old name field to compatibly handle missing 
> values. Then other tools that don't immediately apply schema resolution, can 
> optionally upgrade to support using the {{referenceName}} instead of the 
> {{resolveName}}.
>   
>  Beyond that, having {{name}} continue to mean {{resolveName}} would mean 
> that old avro implementations would be able to treat newer schemas as valid 
> and resolve against them correctly. So I think it's reasonable to say that 
> {{referenceName}} should be the new field introduced (not necessarily with 
> that name).
>   
>  Assuming I've made no mistakes up until this point, there are some remaining 
> questions:
>   
>  1. How should this appear in IDL? There are two solutions that come to mind, 
> using the existing intuition of how annotations work:
>   
>    a. Declared as: {{@ref("UserV2") record User {}}}
>        Used as: {{@ref("UserV2") User user;}}
>   
>        - Can be quite verbose
>        - The annotated type ({{User}}) only really exists as a placeholder
>        - Requires a new error message for when a type is used but needs a 
> reference name.
>   b. Declared as: {{@name("User") record UserV2 {}}}
>        Used as: {{UserV2 user;}}
>   
>        - Relies on a weird intuition of how a "type name" maps to a raw 
> schema. Normally, the type name becomes the schema's name field, but when the 
> {{name}} field is specified by annotation, the type name just becomes the 
> {{referenceName}} field.
>        - Requires more work to change the implementation of type annotations, 
> since {{name}} is normally a reserved field.
>        - Matches the current intuition that two schemas resolve only if their 
> {{name}} fields match, and that type names are always used for references.
>   
>  2. How should the {{namespace}} field be handled?
> Should both names share the {{namespace}} field and act like the name field 
> does now (where any name that contains a dot is assumed to specify the full 
> name)? Or should reference names ignore the {{namespace}} field (and maybe 
> have one of their own)?
>   
>  3. If avro parsers are changed to be use {{referenceName}} instead of 
> {{name}}, when present, how much concern is there about old parsers not being 
> able to parse the new {{referenceName}} field if they are used sparingly 
> (only generated when necessary)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to