Re: Avro union compatibility mode enhancement proposal

Matthieu Monsch Thu, 16 Jun 2016 09:02:33 -0700

Thanks for the feedback Doug!

I think the change would still fit under the "formerly known as" umbrella. More 
specifically, the current implementation implements a "formerly read as" and 
the proposed suggestion would also allow the "formerly written as" counterpart.


Perhaps a new example would help make this explicit (I picked the first example 
to show that we could use this to solve the thread's initial union evolution 
issue but that also made the underlying logic less explicit). Assume we have 
the following records:

// Old schema.
record Event {
  long id;
}

// New schema.
@aliases(["Event"])
record DetailedEvent {
  long id;
  string detail = "";
}
Schema evolution between readers and writers can happen in two ways:

Readers switch first to the new DetailedEvent. This is currently supported by 
the alias (the reader is aware that the DetailedEvent was "formerly read as" 
Event).
Writers switch first, and readers now become unable to read the new data. There 
is no way currently for the writer to communicate the "formerly written as" 
relationship.
Phrasing it differently, aliases are currently unidirectional and this change 
makes them bidirectional. I feel the added symmetry makes them slightly easier 
to understand as well: no need to remember which of the reader's or writer's 
schema's aliases are taken into account (with the caveat that if both are 
defined, some priority must be enforced).

What do you think?

I'm not very familiar with the Java implementation of schema resolution but 
from looking at the code it looks like it should be straightforward to make 
this work. For example we could allow the branch name to be passed internally 
to avoid duplicates. Or maybe something even simpler since it doesn't seem like 
such "resolving schemas" (created by the applyAliases method) can ever be used 
to write (would it even make sense to?) since they are created by 
ResolvingDecoders which don't expose them afterwards. Unless I missed something?

-Matthieu



> On Jun 14, 2016, at 10:40 AM, Doug Cutting <[email protected]> wrote:
> 
> Matthieu,
> 
> Thanks for the example.
> 
> First, is this really an alias, or is it something else?  In other
> words, would a reader ever map a written Vehicle to a Bus?  If the use
> cases are exclusive, perhaps we should call it something different
> rather than overload the alias concept?
> 
> Second, would the alias implementation, rewriting the writer's schema,
> work here?  It would result in a union with two, different, Vehicle
> records.  That could probably be made to work, but any other
> references to the Vehicle schema might become ambiguous.  I suspect
> the implementation may end up being quite different.
> 
> Aliases currently mean, "formerly known as", this feature seems more
> like, "a kind of".
> 
> Doug
> 
> On Sat, Jun 11, 2016 at 7:43 PM, Matthieu Monsch <[email protected] 
> <mailto:[email protected]>> wrote:
>> Happy to provide an example. Let’s assume that we have a Kafka producer 
>> emitting the following values:
>> union {
>>  record Vehicle {
>>    int id;
>>  },
>>  record Car {
>>    int id;
>>    boolean selfDriving;
>>  }
>> }
>> At a later point in time, a new vehicle becomes supported by the system and 
>> must be added to the schema:
>> 
>> union {
>>  record Vehicle {
>>    long id;
>>  },
>>  record Car {
>>    long id;
>>    boolean selfDriving;
>>  },
>>  @aliases(["Vehicle"]) // Ignored when on the producer's schema.
>>  record Bus {
>>    long id;
>>    int capacity;
>>  }
>> }
>> We would like to be able to deploy the change to the producer without having 
>> to migrate all the consumers: existing consumers would treat each Bus as a 
>> Vehicle until they upgrade.
>> 
>> However we can't do so under the current evolution rules since the alias is 
>> ignored (it would work if we added the alias to each consumer's schema but 
>> this isn't practical since it would also require a global migration). Note 
>> also that we can't preemptively add aliases on the consumers since the names 
>> of the records aren't known beforehand.
>> 
>> Allowing the consumers (readers) to use the producer's (writer’s) aliases 
>> would fix this. If we make sure that writer aliases are used last (for 
>> example only falling back to them if neither the names nor the consumers' 
>> aliases match), this doesn't change any of the current allowed evolution 
>> rules and expands them to support additional cases (without introducing any 
>> new syntax).
>> 
>> Does this make sense?
>> 
>> -Matthieu
>> 
>> Ps: In case it’s more readable, this example can also be read here: 
>> https://gist.github.com/mtth/527318445e5b52bfd491c0483ff5f9d3 
>> <https://gist.github.com/mtth/527318445e5b52bfd491c0483ff5f9d3><https://gist.github.com/mtth/527318445e5b52bfd491c0483ff5f9d3
>>  <https://gist.github.com/mtth/527318445e5b52bfd491c0483ff5f9d3>> .
>> 
>> 
>> 
>>> On Jun 10, 2016, at 2:00 PM, Doug Cutting <[email protected]> wrote:
>>> 
>>> Matthieu,
>>> 
>>> Can you please provide an example of how this would work?
>>> 
>>> Thanks,
>>> 
>>> Doug
>>> 
>>> On Thu, Jun 9, 2016 at 6:47 PM, Matthieu Monsch <[email protected]> wrote:
>>> 
>>>> Thinking about this a bit more (and a couple months later…), maybe there
>>>> is a simpler alternative.
>>>> 
>>>> Currently, a reason why writer evolution is hard (the union issue
>>>> described below is a special case of this) is that aliases are only used on
>>>> the reader side. Why not also allow readers to use the writer’s aliases?
>>>> 
>>>> Resolution would first be done on names, then fall back to reader aliases,
>>>> and finally fall back to writer aliases. In the example below, it would be
>>>> enough to add an alias to the base record inside any new records to have
>>>> evolution work.
>>>> 
>>>> -Matthieu
>>>> 
>>>> 
>>>> 
>>>>> On Apr 22, 2016, at 8:42 AM, Matthieu Monsch <[email protected]>
>>>> wrote:
>>>>> 
>>>>> The second solution sounds like a great alternative.
>>>>> 
>>>>> Branch aliases are more straightforward than an implicit order-sensitive
>>>> policy. They also have the additional benefit of giving users a bit more
>>>> flexibility: since defaults are specified on the branches’ types, it is
>>>> possible to have different branches have different defaults inside the same
>>>> union. There are probably a few edge cases (e.g. allowing multiple such
>>>> aliases would be useful) but they should be simple to address.
>>>>> 
>>>>> What would be a good attribute name for this? `baseTypes`?
>>>>> 
>>>>> -Matthieu
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Apr 21, 2016, at 10:52 AM, Doug Cutting <[email protected]> wrote:
>>>>>> 
>>>>>> On Wed, Apr 20, 2016 at 9:09 PM, Ryan Blue <[email protected]>
>>>> wrote:
>>>>>>> Making the default a property of an
>>>>>>> inner schema makes me think that we will have to deal with multiple
>>>> schemas
>>>>>>> with such a label at some point.
>>>>>> 
>>>>>> On Thu, Apr 21, 2016 at 6:54 AM, Matthieu Monsch <[email protected]>
>>>> wrote:
>>>>>>> Delegating default selection to the branches themselves is a great
>>>> idea but it
>>>>>>> will be tricky to handle reference branches smoothly. More minor but
>>>> it also
>>>>>>> doesn’t feel intuitive to not have the union “own” its default
>>>> attribute.
>>>>>> 
>>>>>> If I understand your concerns correctly, I attempted to address this
>>>> above:
>>>>>> 
>>>>>> "Note however that, when using a record as the default branch, one
>>>>>> could not then
>>>>>> use that same record as a non-default branch in another union.  To
>>>>>> ameliorate that, we might permit multiple default branches in a union
>>>>>> to be specified as default with the convention that the first such is
>>>>>> used."
>>>>>> 
>>>>>> Does that make sense?
>>>>>> 
>>>>>> This isn't ideal syntax, but it's not terrible, and it doesn't change
>>>>>> schema syntax incompatibly, which seems important, especially when its
>>>>>> unlikely that all implementations would implement such a syntax change
>>>>>> in a synchronized manner.
>>>>>> 
>>>>>> Alternately, one might annotate each derived record with the name of
>>>>>> its base record, then one wouldn't need to alter union definitions.
>>>>>> This would work like an alias.  If a record doesn't exist in the
>>>>>> reader's schema, then an alias to the missing record would be added in
>>>>>> the reader's schema to the base record it names in the writer's
>>>>>> schema.  Aliases work by rewriting the writer's schema at read-time,
>>>>>> updating names, including those in unions.  Might that work?  It seems
>>>>>> like perhaps a more elegant approach.  It has compatible syntax and
>>>>>> only alters behavior of a case that fails today.
>>>>>> 
>>>>>> Doug

Re: Avro union compatibility mode enhancement proposal

Reply via email to