[jira] [Created] (AVRO-2159) Naming Limitations of Schemas in Stricter Reference Contexts

Bridger Howell (JIRA) Wed, 14 Mar 2018 20:12:21 -0700

Bridger Howell created AVRO-2159:
------------------------------------

             Summary: Naming Limitations of Schemas in Stricter Reference 
Contexts
                 Key: AVRO-2159
                 URL: https://issues.apache.org/jira/browse/AVRO-2159
             Project: Avro
          Issue Type: New Feature
            Reporter: Bridger Howell



(Excuse the lengthiness of this ticket description - it was initially written 
as an email that became too long. Feel free to correct any misguided reasoning.)

I've come to realize that there are some undesirable constraints on how avro 
schemas can be used in Java code generation and IDL, that only appear as minor 
annoyances when you use schemas generically. In particular, I'm focused on 
cases where it's desirable to use two schemas that have the same name in some 
context.
  
 *Issue:*
 Suppose I'm writing an application that publishes a many different kinds of 
data somewhere, with each type of data having its own schema. And then suppose 
that a some number of those schemas would like to share some kind of common 
schema, to start with.
  
 If I do this, and I happen to be using Java code generation to manage schemas, 
I'll soon find difficulty in two directions:
  
 - I would find it difficult to upgrade the data shared among all of these 
external schemas by way of the common schema, without upgrading all of those 
schemas at the same time. The problem here being that neither Java's classpath 
nor an IDL protocol can support the way avro's name field maps as a class name 
onto the classpath or a reference name onto a protocol's symbols.
  
 The intermediate step of the application being partially migrated between 
version 1 and version 2 of a common schema has no representation in either of 
these contexts. Using a different name becomes a very annoying option in many 
cases, since it is an incompatible change (or with aliases, it's at least not 
consistently compatible across implementations).
 - I would find it difficult to migrate away from the external schemas using 
that shared schema, for the same reasons listed above.

In IDL (without code generation), these issues can usually be avoided by 
creating a second protocol, and in generic avro, the issues would be avoided by 
using a different schema parser or schema builder.
  
 *Analysis:*
 At first glance, it is tempting to blame the name-matching requirement for 
schema resolution as a culprit - and it may be correct in many cases that 
requiring schemas have compatible structure is all that is needed.
  
 However, the way I see it is that the name-matching requirement for schema 
resolution is there to ensure that there is _the intent for two schemas to 
resolve with each other_, and the rest of the checks are just there to make 
sure that such an intent can be reasonably carried out.
  
 The difficulty from either the two examples above happens not because of a 
lack of pre-determined intent for schemas to resolve, but rather the inability 
to simultaneously supply a unique reference for each of the schemas, while 
intending that the correct groups of schemas can resolve.
  
 Thus, the way to avoid these issues so far has been to create a new reference 
context, and the severity of the issue in each case corresponds to the 
difficulty of creating a new reference context:
 * For generic schemas, create a new parser or schema builder [easy - minorly 
annoying]
 * For IDL, create a new protocol [minorly annoying - somewhat annoying]
 * For Java code generation, create a new classpath [very annoying (Java 9) - 
impossible]

Based on that, I understand a schema's name as expressing two overlapping 
meanings:
 - the intent to be able to resolve with other schemas with the same name 
(let's call this the {{resolveName}})
 - the ability to be uniquely referenced from some context (let's call this the 
{{referenceName}})

 
 If these two meanings were able to be specified independently, I think that 
schemas would be much easier to use in contexts where references are more 
limited.
  
 *Speculative Solutions:*
 Minimally, I think it's reasonable to create at least one new field to 
separate the meaning of a schema's {{referenceName}} from its {{resolveName}}, 
and use the old name field to compatibly handle missing values. Then other 
tools that don't immediately apply schema resolution, can optionally upgrade to 
support using the {{referenceName}} instead of the {{resolveName}}.
  
 Beyond that, having {{name}} continue to mean {{resolveName}} would mean that 
old avro implementations would be able to treat newer schemas as valid and 
resolve against them correctly. So I think it's reasonable to say that 
{{referenceName}} should be the new field introduced (not necessarily with that 
name).
  
 Assuming I've made no mistakes up until this point, there are some remaining 
questions:
  
 1. How should this appear in IDL? There are two solutions that come to mind, 
using the existing intuition of how annotations work:
  
   a. Declared as: {{@ref("UserV2") record User {}}}
       Used as: {{@ref("UserV2") User user;}}
  
       - Can be quite verbose
       - The annotated type ({{User}}) only really exists as a placeholder
       - Requires a new error message for when a type is used but needs a 
reference name.

  b. Declared as: {{@name("User") record UserV2 {}}}
       Used as: {{UserV2 user;}}
  
       - Relies on a weird intuition of how a "type name" maps to a raw schema. 
Normally, the type name becomes the schema's name field, but when the {{name}} 
field is specified by annotation, the type name just becomes the 
{{referenceName}} field.
       - Requires more work to change the implementation of type annotations, 
since {{name}} is normally a reserved field.
       - Matches the current intuition that two schemas resolve only if their 
{{name}} fields match, and that type names are always used for references.
  
 2. How should the {{namespace}} field be handled?

Should both names share the {{namespace}} field and act like the name field 
does now (where any name that contains a dot is assumed to specify the full 
name)? Or should reference names ignore the {{namespace}} field (and maybe have 
one of their own)?
  
 3. If avro parsers are changed to be use {{referenceName}} instead of 
{{name}}, when present, how much concern is there about old parsers not being 
able to parse the new {{referenceName}} field if they are used sparingly (only 
generated when necessary)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AVRO-2159) Naming Limitations of Schemas in Stricter Reference Contexts

Reply via email to