Re: Avro union compatibility mode enhancement proposal

Matthieu Monsch Sat, 16 Apr 2016 13:24:19 -0700

> I think that substituting different data for
> unrecognized branches in a union isn't the way to fix the problem, but I
> have a proposal for a way to fix it that looks more like you'd expect in
> the OO example above: by adding the Vehicle class to your read schema.
> 
> Right now, unions are first resolved by full class name as required by the
> spec. But after that, we have some additional rules to match schemas. These
> rules are how we reconcile name differences from situations like writing
> with a generic class and reading with a specific class. I'm proposing you
> use a catch-all class (the superclass) with fields that are in all of the
> union's branches, and we update schema resolution to allow it.



That sounds good. The only thing I’d add is to make this catch-all behavior 
explicit in the schema (similar to how field defaults must be explicitly added).
To help fix another common writer evolution issue, we could also add a similar 
catch-all for `enum`s (optional, to be explicitly specified in the schema).
-Matthieu



> On Apr 12, 2016, at 2:41 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Yacine,
> 
> Thanks for the extra information. Sorry for my delay in replying, I wanted
> to time think about your suggestion.
> 
> I see what you mean that you can think of a union as the superclass of its
> options. The reflect object model has test code that does just that, where
> classes B and C inherit from A and the schema for A is created as a union
> of B and C. But, I don't think that your suggestion aligns with the
> expectations of object oriented design. Maybe that's an easier way to
> present my concern:
> 
> Say I have a class, Vehicle, with subclasses Car and Truck. I have
> applications that work with my dataset, the vehicles that my company owns,
> and we buy a bus. If I add a Bus class, what normally happens is that an
> older application can work with it. A maintenance tracker would call
> getVehicles and can still get the lastMaintenanceDate for my Bus, even
> though it doesn't know about busses. But what you suggest is that it is
> replaced with a default, say null, in cases like this.
> 
> I think that the problem is that Avro no equivalent concept of inheritance.
> There's only one way to represent it for what you need right now, like
> Matthieu suggested. I think that substituting different data for
> unrecognized branches in a union isn't the way to fix the problem, but I
> have a proposal for a way to fix it that looks more like you'd expect in
> the OO example above: by adding the Vehicle class to your read schema.
> 
> Right now, unions are first resolved by full class name as required by the
> spec. But after that, we have some additional rules to match schemas. These
> rules are how we reconcile name differences from situations like writing
> with a generic class and reading with a specific class. I'm proposing you
> use a catch-all class (the superclass) with fields that are in all of the
> union's branches, and we update schema resolution to allow it. So you'd
> have...
> 
> record Vehicle {
>  long lastMaintenanceDate;
> }
> 
> record Car {
>  long lastMaintenanceDate;
>  int maxChildSeats;
> }
> 
> record Truck {
>  long lastMainetanceDate;
>  long lastWashedDate;
> }
> 
> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The
> catch-all class could provide the behavior you want, without silently
> giving you incorrect data.
> 
> Does that sound like a reasonable solution for your use case?
> 
> rb
> 
> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <mon...@alum.mit.edu>
> wrote:
> 
>> Hi Yacine,
>> 
>> I believe Ryan was mentioning that if you start from a schema `[“null”,
>> “Car”]` then rather than add a new bus branch to the union, you could
>> update the car’s schema to include new bus fields. For example (using IDL
>> notation):
>> 
>>> // Initial approach
>>> // Evolved to union { null, Car, Bus }
>>> record Car {
>>>  string vin;
>>> }
>>> record Bus {
>>>  string vin;
>>>  int capacity;
>>> }
>>> 
>>> // Alternative approach
>>> // Here evolution wouldn't change the union { null, Vehicle }
>>> // Note also that this is actually a compatible evolution from the first
>> approach, even under current rules (using aliases for name changes).
>>> record Vehicle {
>>>  string vin; // (Would be optional if not all branches had it.)
>>>  union { null, int } capacity; // The new field is added here.
>>> }
>> 
>> Pushing this further, you could also directly start upfront with just a
>> “Vehicle” record schema with an optional `car` field and add new optional
>> fields as necessary (for example a `bus` field). This gives you much
>> greater flexibility for evolving readers and writers separately (but you
>> lose the ability to enforce that at most one field is ever present in the
>> record).
>> 
>> I agree that Avro’s current evolution rules are more geared towards reader
>> evolution: evolving writers while keeping readers constant is difficult
>> (namely because they don't support adding branches to unions or symbols to
>> enums). However, adding a completely new and separate "compatibility mode”
>> has a high complexity cost; both for implementors (the separate classes you
>> mention) and users (who must invoke them specifically). It would be best to
>> keep evolution rules universal.
>> 
>> Maybe we could extend the logic behind field defaults? Enum schemas could
>> simply contain a default symbol attribute. For unions, it’s a bit trickier,
>> but we should be able to get around with a default branch attribute which
>> would allow evolution if and only if the newly added branches can be read
>> as the default branch. Assuming the attribute names don’t already exist,
>> this would be compatible with the current behavior.
>> 
>> I think this use-case is common enough that it’d be worth exploring (for
>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address this
>> issue; though not completely unfortunately [2]).
>> 
>> -Matthieu
>> 
>> [1] https://github.com/linkedin/rest.li
>> [2]
>> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible
>> 
>> 
>> 
>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane <
>> ybenabderrahm...@octo.com> wrote:
>>> 
>>> Hi Ryan,
>>> 
>>> Just a little up^^. Could you please (or anyone else) give me a little
>>> feedback ?
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Yacine.
>>> 
>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane <
>> ybenabderrahm...@octo.com
>>>> :
>>> 
>>>> Hi Ryan,
>>>> 
>>>> Thank you for giving feedback. I will try in the following to provide
>> you
>>>> some more details about the addressed problem.
>>>> 
>>>> But before that, just a brief reminder of the context. Avro has been
>>>> chosen in this project (and by many other ones for sure) especially for
>> a
>>>> very important feature: enabling forward and backward compatibility
>>>> management through schema life-cycle. Our development model involves
>>>> intensive usage of this feature, and many heavy developments are made in
>>>> parallel streams inside feature teams that share the same schema,
>> provided
>>>> the evolution of the latter complies with the stated compatibility
>> rules.
>>>> This implies that all the entities supported by the Avro schema must
>>>> support the two-way compatibility, including unions. However, in the
>>>> special case of the union, this two-way compatibility is not well
>> supported
>>>> by the current rules. Let me explain you the basement of our point of
>> view,
>>>> it remains quite simple.
>>>> 
>>>> The use case is to have, for example,
>>>> - a first union version A:
>>>>     { "name": "Vehicle",
>>>>       "type": ["null", "Car"] }
>>>> - a new version of it B:
>>>>     { "name": "Vehicle",
>>>>       "type": ["null", "Car", "Bus"] }
>>>> For being forward compatible, an evolution of the union schema must
>>>> guarantee that an old reader reading with A can read the data written
>> with
>>>> the new schema B. Getting an error just means that the forward
>>>> compatibility feature is broken. But this is not actually the case (and
>>>> this behavior is not suitable), because the old reader has a correct
>> schema
>>>> and this schema has evolved naturally to version B to incorporate a new
>>>> Vehicle type. Not knowing this new type must not produce an error, but
>> just
>>>> give the reader a default value, which means: "Either the data is not
>> there
>>>> or you do not know how to handle it".
>>>> 
>>>> This is thought while keeping in mind that in an object-oriented code
>>>> modeling, a union field is seen as a class member with the higher level
>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it is
>>>> natural for a modeler / programmer to implement the ability of not
>> getting
>>>> the awaited types and using some default value of known type. To give a
>>>> more complete specification, the new mode of compatibility has to impose
>>>> one rule: the union default value must not change through versions and
>> the
>>>> corresponding type must be placed at the top of the types list. This is
>>>> much easier to handle by development streams, because it is addressed
>> once
>>>> for all in the very beginning of the schema life-cycle, than the fact to
>>>> oblige a number of teams, among which some are just not in place
>> anymore,
>>>> to update the whole code just because another dev team has deployed a
>> new
>>>> version of the union in the schema.
>>>> 
>>>> Now, for being backward compatible, the reader with B must always be
>> able
>>>> to read data written with schema A. Even if the type included in the
>> data
>>>> is not known, so it gets the default value and not an error.
>>>> 
>>>> I understand that getting an error could make sense when the requested
>>>> field is not present. However, this behavior:
>>>> 
>>>>  - is very restrictive, meaning: this obliges the old reader to update
>>>>  its code for integrating the new schema, while he is not managing to
>> do it
>>>>  for many reasons: development stream of next delivery is not
>> finished, or
>>>>  not engaged, or not even planned - in the case of old and stable code
>>>>  - breaks the forward compatibility feature: the older reader is not
>>>>  able to read the new version of the union without getting an error
>>>>  - breaks the backward compatibility feature: the new reader is not
>>>>  able to read an old version containing unknown types of the union
>> without
>>>>  getting an error
>>>> 
>>>> By the way, what do you exactly mean by "pushing evolution lower" and
>>>> "update the record"? Could you please give me an example of the trick
>> you
>>>> are talking about?
>>>> 
>>>> Just to be a bit more precise, we are not targeting to use a "trick".
>> This
>>>> life-cycle management should be included in a standard so to keep the
>>>> software development clean, production safe and compliant with a complex
>>>> product road-map.
>>>> 
>>>> Finally, you seem to be concerned by the "significant change to the
>>>> current evolution rules". Well, we actually do not change these rules,
>> they
>>>> keep just the same. All we are proposing is to introduce a *mode* where
>>>> the rules of union compatibility change. This mode is materialized by a
>>>> minimum and thin impact of the existing classes without any change in
>> the
>>>> behavior, all the logic of the new compatibility mode is implemented by
>> new
>>>> classes that must be invoked specifically. But you would better see it
>> in
>>>> the code patch.
>>>> 
>>>> Looking forward to reading your feedback and answers.
>>>> 
>>>> Regards,
>>>> Yacine.
>>>> 
>>>> 
>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rb...@netflix.com.invalid>:
>>>> 
>>>>> Hi Yacine,
>>>>> 
>>>>> Thanks for the proposal. It sounds interesting, but I want to make sure
>>>>> there's a clear use case for this because it's a significant change to
>> the
>>>>> current evolution rules. Right now we guarantee that a reader will get
>> an
>>>>> error if the data has an unknown union branch rather than getting a
>>>>> default
>>>>> value. I think that makes sense: if the reader is requesting a field,
>> it
>>>>> should get the actual datum for it rather than a default because it
>>>>> doesn't
>>>>> know how to handle it.
>>>>> 
>>>>> Could you give us an example use case that requires this new logic?
>>>>> 
>>>>> I just want to make sure we can't solve your problem another way. For
>>>>> example, pushing evolution lower in the schema usually does the trick:
>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1",
>>>>> "RecordV2"], it is usually better to update the record so that older
>>>>> readers can ignore the new fields.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> rb
>>>>> 
>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane <
>>>>> ybenabderrahm...@octo.com> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> In order to provide a solution to the union schema evolution problem,
>>>>> as it
>>>>>> was earlier notified in the thread "add a type to a union
>>>>>> <
>>>>>> 
>>>>> 
>> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union
>>>>>>> "
>>>>>> of the user mailing list, we decided, for the needs of the reactive
>>>>>> architecture we have implemented for one of our clients, to implement
>> an
>>>>>> evolution of the compatibility principle of Avro when using Unions.
>> For
>>>>>> reminder, the asked question was about the way to handle the case
>> where
>>>>> a
>>>>>> reader, using an old version of a schema that includes a union, reads
>>>>> some
>>>>>> data written with a new version of the schema where a type has been
>>>>> added
>>>>>> to the union.
>>>>>> 
>>>>>> As answered by Martin Kleppman in that thread, one way to handle this
>>>>> kind
>>>>>> of evolution (a new version of the schema adds a new type type in a
>>>>> union)
>>>>>> would be to ensure that all the development streams have integrated
>> the
>>>>> new
>>>>>> schema B before deploying it in the IT schema referential.
>>>>>> However, in big structures involving strongly uncorrelated teams (in
>> the
>>>>>> product life-cycle point of view), this approach appears to be quite
>>>>>> impracticable, causing production stream congestion, blocking behavior
>>>>>> between teams, and a bunch of other
>>>>>> unwanted-counter-agile-/-reactive-phenomena...
>>>>>> 
>>>>>> Therefore, we had to implement a new *compatibility* *mode* for the
>>>>> unions,
>>>>>> while taking care to comply with the following rules:
>>>>>> 
>>>>>>  1. Clear rules of compatibility are stated and integrated for this
>>>>>>  compatibility mode
>>>>>>  2. The standard Avro behavior must be kept intact
>>>>>>  3. All the evolution implementation must be done without introducing
>>>>> any
>>>>>>  regression (all existing tests of the Avro stack must succeed)
>>>>>>  4. The code impact on Avro stack must be minimized
>>>>>> 
>>>>>> Just to give you a very brief overview (as I don't know if this is
>>>>> actually
>>>>>> the place for a full detailed description), the evolution addresses
>> the
>>>>>> typical problem where two development streams use the same schema but
>> in
>>>>>> different versions, in the case described shortly as follows:
>>>>>> 
>>>>>>  - The first development stream, called "DevA", uses the version A of
>>>>> a
>>>>>>  schema which integrates a union referencing two types, say "null"
>> and
>>>>>>  "string". The default value is set to null.
>>>>>>  - The second development team, called "DevB", uses the version B,
>>>>> which
>>>>>>  is an evolution of the version A, as it adds a reference to a new
>>>>> type
>>>>>> in
>>>>>>  the former union, say "long" (which makes it "null", string" and
>>>>> "long")
>>>>>>  - When the schema B is deployed on the schema referential (in our
>>>>> case,
>>>>>>  the IO Confluent Schema Registry) subsequently to the version A
>>>>>>     - The stream "DevA" must be able to read with schema A, even if
>>>>> the
>>>>>>     data has been written using the schema B with the type "long" in
>>>>>> the union.
>>>>>>     In the latter case, the read value is the union default value
>>>>>>     - The stream "DevB" must be able to read/write with schema B,
>>>>> even if
>>>>>>     it writes the data using the type "long" in the union
>>>>>> 
>>>>>> The evolution that we implemented for this mode includes some rules
>> that
>>>>>> are based on the principles stated in the Avro documentation. It is
>> even
>>>>>> more powerful than showed in the few lines above, as it enables the
>>>>> readers
>>>>>> to get the default value of the union if the schema used for reading
>>>>> does
>>>>>> not contain the type used by the writer in the union. This achieves a
>>>>> new
>>>>>> mode of forward / backward compatibility. This evolution is for now
>>>>> working
>>>>>> perfectly, and should be on production in the few coming weeks. We
>> have
>>>>>> also made an evolution of the IO Confluent Schema Registry stack to
>>>>> support
>>>>>> it, again in a transparent manner (we also intend to contribute to
>> this
>>>>>> stack in a second / parallel step).
>>>>>> 
>>>>>> In the objective of contributing to the Avro stack with this new
>>>>>> compatibility mode for unions, I have some questions about the
>>>>> procedure:
>>>>>> 
>>>>>>  1. How can I achieve the contribution proposal? Should I directly
>>>>>>  provide a patch in JIRA and dive into the details right there?
>>>>>>  2. The base version of this evolution is 1.7.7, is it eligible to
>>>>>>  contribution evaluation anyway?
>>>>>> 
>>>>>> Thanks in advance, looking forward to hearing from you and giving you
>>>>> more
>>>>>> details.
>>>>>> 
>>>>>> Kind Regards,
>>>>>> --
>>>>>> *Yacine Benabderrahmane*
>>>>>> Architect
>>>>>> *OCTO Technology*
>>>>>> <http://www.octo.com>
>>>>>> -----------------------------------------------
>>>>>> Tel : +33 6 10 88 25 98
>>>>>> 50 avenue des Champs Elysées
>>>>>> 75008 PARIS
>>>>>> www.octo.com
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Avro union compatibility mode enhancement proposal

Reply via email to