> I think that substituting different data for > unrecognized branches in a union isn't the way to fix the problem, but I > have a proposal for a way to fix it that looks more like you'd expect in > the OO example above: by adding the Vehicle class to your read schema. > > Right now, unions are first resolved by full class name as required by the > spec. But after that, we have some additional rules to match schemas. These > rules are how we reconcile name differences from situations like writing > with a generic class and reading with a specific class. I'm proposing you > use a catch-all class (the superclass) with fields that are in all of the > union's branches, and we update schema resolution to allow it.
That sounds good. The only thing I’d add is to make this catch-all behavior explicit in the schema (similar to how field defaults must be explicitly added). To help fix another common writer evolution issue, we could also add a similar catch-all for `enum`s (optional, to be explicitly specified in the schema). -Matthieu > On Apr 12, 2016, at 2:41 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Yacine, > > Thanks for the extra information. Sorry for my delay in replying, I wanted > to time think about your suggestion. > > I see what you mean that you can think of a union as the superclass of its > options. The reflect object model has test code that does just that, where > classes B and C inherit from A and the schema for A is created as a union > of B and C. But, I don't think that your suggestion aligns with the > expectations of object oriented design. Maybe that's an easier way to > present my concern: > > Say I have a class, Vehicle, with subclasses Car and Truck. I have > applications that work with my dataset, the vehicles that my company owns, > and we buy a bus. If I add a Bus class, what normally happens is that an > older application can work with it. A maintenance tracker would call > getVehicles and can still get the lastMaintenanceDate for my Bus, even > though it doesn't know about busses. But what you suggest is that it is > replaced with a default, say null, in cases like this. > > I think that the problem is that Avro no equivalent concept of inheritance. > There's only one way to represent it for what you need right now, like > Matthieu suggested. I think that substituting different data for > unrecognized branches in a union isn't the way to fix the problem, but I > have a proposal for a way to fix it that looks more like you'd expect in > the OO example above: by adding the Vehicle class to your read schema. > > Right now, unions are first resolved by full class name as required by the > spec. But after that, we have some additional rules to match schemas. These > rules are how we reconcile name differences from situations like writing > with a generic class and reading with a specific class. I'm proposing you > use a catch-all class (the superclass) with fields that are in all of the > union's branches, and we update schema resolution to allow it. So you'd > have... > > record Vehicle { > long lastMaintenanceDate; > } > > record Car { > long lastMaintenanceDate; > int maxChildSeats; > } > > record Truck { > long lastMainetanceDate; > long lastWashedDate; > } > > Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The > catch-all class could provide the behavior you want, without silently > giving you incorrect data. > > Does that sound like a reasonable solution for your use case? > > rb > > On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <mon...@alum.mit.edu> > wrote: > >> Hi Yacine, >> >> I believe Ryan was mentioning that if you start from a schema `[“null”, >> “Car”]` then rather than add a new bus branch to the union, you could >> update the car’s schema to include new bus fields. For example (using IDL >> notation): >> >>> // Initial approach >>> // Evolved to union { null, Car, Bus } >>> record Car { >>> string vin; >>> } >>> record Bus { >>> string vin; >>> int capacity; >>> } >>> >>> // Alternative approach >>> // Here evolution wouldn't change the union { null, Vehicle } >>> // Note also that this is actually a compatible evolution from the first >> approach, even under current rules (using aliases for name changes). >>> record Vehicle { >>> string vin; // (Would be optional if not all branches had it.) >>> union { null, int } capacity; // The new field is added here. >>> } >> >> Pushing this further, you could also directly start upfront with just a >> “Vehicle” record schema with an optional `car` field and add new optional >> fields as necessary (for example a `bus` field). This gives you much >> greater flexibility for evolving readers and writers separately (but you >> lose the ability to enforce that at most one field is ever present in the >> record). >> >> I agree that Avro’s current evolution rules are more geared towards reader >> evolution: evolving writers while keeping readers constant is difficult >> (namely because they don't support adding branches to unions or symbols to >> enums). However, adding a completely new and separate "compatibility mode” >> has a high complexity cost; both for implementors (the separate classes you >> mention) and users (who must invoke them specifically). It would be best to >> keep evolution rules universal. >> >> Maybe we could extend the logic behind field defaults? Enum schemas could >> simply contain a default symbol attribute. For unions, it’s a bit trickier, >> but we should be able to get around with a default branch attribute which >> would allow evolution if and only if the newly added branches can be read >> as the default branch. Assuming the attribute names don’t already exist, >> this would be compatible with the current behavior. >> >> I think this use-case is common enough that it’d be worth exploring (for >> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address this >> issue; though not completely unfortunately [2]). >> >> -Matthieu >> >> [1] https://github.com/linkedin/rest.li >> [2] >> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible >> >> >> >>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane < >> ybenabderrahm...@octo.com> wrote: >>> >>> Hi Ryan, >>> >>> Just a little up^^. Could you please (or anyone else) give me a little >>> feedback ? >>> Thanks in advance. >>> >>> Regards, >>> Yacine. >>> >>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane < >> ybenabderrahm...@octo.com >>>> : >>> >>>> Hi Ryan, >>>> >>>> Thank you for giving feedback. I will try in the following to provide >> you >>>> some more details about the addressed problem. >>>> >>>> But before that, just a brief reminder of the context. Avro has been >>>> chosen in this project (and by many other ones for sure) especially for >> a >>>> very important feature: enabling forward and backward compatibility >>>> management through schema life-cycle. Our development model involves >>>> intensive usage of this feature, and many heavy developments are made in >>>> parallel streams inside feature teams that share the same schema, >> provided >>>> the evolution of the latter complies with the stated compatibility >> rules. >>>> This implies that all the entities supported by the Avro schema must >>>> support the two-way compatibility, including unions. However, in the >>>> special case of the union, this two-way compatibility is not well >> supported >>>> by the current rules. Let me explain you the basement of our point of >> view, >>>> it remains quite simple. >>>> >>>> The use case is to have, for example, >>>> - a first union version A: >>>> { "name": "Vehicle", >>>> "type": ["null", "Car"] } >>>> - a new version of it B: >>>> { "name": "Vehicle", >>>> "type": ["null", "Car", "Bus"] } >>>> For being forward compatible, an evolution of the union schema must >>>> guarantee that an old reader reading with A can read the data written >> with >>>> the new schema B. Getting an error just means that the forward >>>> compatibility feature is broken. But this is not actually the case (and >>>> this behavior is not suitable), because the old reader has a correct >> schema >>>> and this schema has evolved naturally to version B to incorporate a new >>>> Vehicle type. Not knowing this new type must not produce an error, but >> just >>>> give the reader a default value, which means: "Either the data is not >> there >>>> or you do not know how to handle it". >>>> >>>> This is thought while keeping in mind that in an object-oriented code >>>> modeling, a union field is seen as a class member with the higher level >>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it is >>>> natural for a modeler / programmer to implement the ability of not >> getting >>>> the awaited types and using some default value of known type. To give a >>>> more complete specification, the new mode of compatibility has to impose >>>> one rule: the union default value must not change through versions and >> the >>>> corresponding type must be placed at the top of the types list. This is >>>> much easier to handle by development streams, because it is addressed >> once >>>> for all in the very beginning of the schema life-cycle, than the fact to >>>> oblige a number of teams, among which some are just not in place >> anymore, >>>> to update the whole code just because another dev team has deployed a >> new >>>> version of the union in the schema. >>>> >>>> Now, for being backward compatible, the reader with B must always be >> able >>>> to read data written with schema A. Even if the type included in the >> data >>>> is not known, so it gets the default value and not an error. >>>> >>>> I understand that getting an error could make sense when the requested >>>> field is not present. However, this behavior: >>>> >>>> - is very restrictive, meaning: this obliges the old reader to update >>>> its code for integrating the new schema, while he is not managing to >> do it >>>> for many reasons: development stream of next delivery is not >> finished, or >>>> not engaged, or not even planned - in the case of old and stable code >>>> - breaks the forward compatibility feature: the older reader is not >>>> able to read the new version of the union without getting an error >>>> - breaks the backward compatibility feature: the new reader is not >>>> able to read an old version containing unknown types of the union >> without >>>> getting an error >>>> >>>> By the way, what do you exactly mean by "pushing evolution lower" and >>>> "update the record"? Could you please give me an example of the trick >> you >>>> are talking about? >>>> >>>> Just to be a bit more precise, we are not targeting to use a "trick". >> This >>>> life-cycle management should be included in a standard so to keep the >>>> software development clean, production safe and compliant with a complex >>>> product road-map. >>>> >>>> Finally, you seem to be concerned by the "significant change to the >>>> current evolution rules". Well, we actually do not change these rules, >> they >>>> keep just the same. All we are proposing is to introduce a *mode* where >>>> the rules of union compatibility change. This mode is materialized by a >>>> minimum and thin impact of the existing classes without any change in >> the >>>> behavior, all the logic of the new compatibility mode is implemented by >> new >>>> classes that must be invoked specifically. But you would better see it >> in >>>> the code patch. >>>> >>>> Looking forward to reading your feedback and answers. >>>> >>>> Regards, >>>> Yacine. >>>> >>>> >>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rb...@netflix.com.invalid>: >>>> >>>>> Hi Yacine, >>>>> >>>>> Thanks for the proposal. It sounds interesting, but I want to make sure >>>>> there's a clear use case for this because it's a significant change to >> the >>>>> current evolution rules. Right now we guarantee that a reader will get >> an >>>>> error if the data has an unknown union branch rather than getting a >>>>> default >>>>> value. I think that makes sense: if the reader is requesting a field, >> it >>>>> should get the actual datum for it rather than a default because it >>>>> doesn't >>>>> know how to handle it. >>>>> >>>>> Could you give us an example use case that requires this new logic? >>>>> >>>>> I just want to make sure we can't solve your problem another way. For >>>>> example, pushing evolution lower in the schema usually does the trick: >>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1", >>>>> "RecordV2"], it is usually better to update the record so that older >>>>> readers can ignore the new fields. >>>>> >>>>> Thanks, >>>>> >>>>> rb >>>>> >>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane < >>>>> ybenabderrahm...@octo.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> In order to provide a solution to the union schema evolution problem, >>>>> as it >>>>>> was earlier notified in the thread "add a type to a union >>>>>> < >>>>>> >>>>> >> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union >>>>>>> " >>>>>> of the user mailing list, we decided, for the needs of the reactive >>>>>> architecture we have implemented for one of our clients, to implement >> an >>>>>> evolution of the compatibility principle of Avro when using Unions. >> For >>>>>> reminder, the asked question was about the way to handle the case >> where >>>>> a >>>>>> reader, using an old version of a schema that includes a union, reads >>>>> some >>>>>> data written with a new version of the schema where a type has been >>>>> added >>>>>> to the union. >>>>>> >>>>>> As answered by Martin Kleppman in that thread, one way to handle this >>>>> kind >>>>>> of evolution (a new version of the schema adds a new type type in a >>>>> union) >>>>>> would be to ensure that all the development streams have integrated >> the >>>>> new >>>>>> schema B before deploying it in the IT schema referential. >>>>>> However, in big structures involving strongly uncorrelated teams (in >> the >>>>>> product life-cycle point of view), this approach appears to be quite >>>>>> impracticable, causing production stream congestion, blocking behavior >>>>>> between teams, and a bunch of other >>>>>> unwanted-counter-agile-/-reactive-phenomena... >>>>>> >>>>>> Therefore, we had to implement a new *compatibility* *mode* for the >>>>> unions, >>>>>> while taking care to comply with the following rules: >>>>>> >>>>>> 1. Clear rules of compatibility are stated and integrated for this >>>>>> compatibility mode >>>>>> 2. The standard Avro behavior must be kept intact >>>>>> 3. All the evolution implementation must be done without introducing >>>>> any >>>>>> regression (all existing tests of the Avro stack must succeed) >>>>>> 4. The code impact on Avro stack must be minimized >>>>>> >>>>>> Just to give you a very brief overview (as I don't know if this is >>>>> actually >>>>>> the place for a full detailed description), the evolution addresses >> the >>>>>> typical problem where two development streams use the same schema but >> in >>>>>> different versions, in the case described shortly as follows: >>>>>> >>>>>> - The first development stream, called "DevA", uses the version A of >>>>> a >>>>>> schema which integrates a union referencing two types, say "null" >> and >>>>>> "string". The default value is set to null. >>>>>> - The second development team, called "DevB", uses the version B, >>>>> which >>>>>> is an evolution of the version A, as it adds a reference to a new >>>>> type >>>>>> in >>>>>> the former union, say "long" (which makes it "null", string" and >>>>> "long") >>>>>> - When the schema B is deployed on the schema referential (in our >>>>> case, >>>>>> the IO Confluent Schema Registry) subsequently to the version A >>>>>> - The stream "DevA" must be able to read with schema A, even if >>>>> the >>>>>> data has been written using the schema B with the type "long" in >>>>>> the union. >>>>>> In the latter case, the read value is the union default value >>>>>> - The stream "DevB" must be able to read/write with schema B, >>>>> even if >>>>>> it writes the data using the type "long" in the union >>>>>> >>>>>> The evolution that we implemented for this mode includes some rules >> that >>>>>> are based on the principles stated in the Avro documentation. It is >> even >>>>>> more powerful than showed in the few lines above, as it enables the >>>>> readers >>>>>> to get the default value of the union if the schema used for reading >>>>> does >>>>>> not contain the type used by the writer in the union. This achieves a >>>>> new >>>>>> mode of forward / backward compatibility. This evolution is for now >>>>> working >>>>>> perfectly, and should be on production in the few coming weeks. We >> have >>>>>> also made an evolution of the IO Confluent Schema Registry stack to >>>>> support >>>>>> it, again in a transparent manner (we also intend to contribute to >> this >>>>>> stack in a second / parallel step). >>>>>> >>>>>> In the objective of contributing to the Avro stack with this new >>>>>> compatibility mode for unions, I have some questions about the >>>>> procedure: >>>>>> >>>>>> 1. How can I achieve the contribution proposal? Should I directly >>>>>> provide a patch in JIRA and dive into the details right there? >>>>>> 2. The base version of this evolution is 1.7.7, is it eligible to >>>>>> contribution evaluation anyway? >>>>>> >>>>>> Thanks in advance, looking forward to hearing from you and giving you >>>>> more >>>>>> details. >>>>>> >>>>>> Kind Regards, >>>>>> -- >>>>>> *Yacine Benabderrahmane* >>>>>> Architect >>>>>> *OCTO Technology* >>>>>> <http://www.octo.com> >>>>>> ----------------------------------------------- >>>>>> Tel : +33 6 10 88 25 98 >>>>>> 50 avenue des Champs Elysées >>>>>> 75008 PARIS >>>>>> www.octo.com >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >> >> > > > -- > Ryan Blue > Software Engineer > Netflix