+1 On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch <mon...@alum.mit.edu> wrote:
> > I think it may make sense for > > enums to add a special value for it. > > That would work great. To give slightly more flexibility to users, we > could even allow the reader’s schema to explicitly specify which symbol to > use when it reads an unknown symbol. If not specified, resolution would > fail (consistent with the current behavior). > > For example (assuming the corresponding key is called “onUnknown”): > > > { > > “type”: “enum”, > > “name”: “Platform”, > > “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”], > > “onUnknown”: “UNSUPPORTED" > > } > > This `enum` would then be able to resolve schemas with extra symbols (read > as `UNSUPPORTED`). > > > > > On Apr 16, 2016, at 1:36 PM, Ryan Blue <rb...@netflix.com.INVALID> > wrote: > > > > Matthieu, how would that work with enums? I think it may make sense for > > enums to add a special value for it. > > > > rb > > > > On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <mon...@alum.mit.edu> > > wrote: > > > >>> I think that substituting different data for > >>> unrecognized branches in a union isn't the way to fix the problem, but > I > >>> have a proposal for a way to fix it that looks more like you'd expect > in > >>> the OO example above: by adding the Vehicle class to your read schema. > >>> > >>> Right now, unions are first resolved by full class name as required by > >> the > >>> spec. But after that, we have some additional rules to match schemas. > >> These > >>> rules are how we reconcile name differences from situations like > writing > >>> with a generic class and reading with a specific class. I'm proposing > you > >>> use a catch-all class (the superclass) with fields that are in all of > the > >>> union's branches, and we update schema resolution to allow it. > >> > >> > >> That sounds good. The only thing I’d add is to make this catch-all > >> behavior explicit in the schema (similar to how field defaults must be > >> explicitly added). > >> To help fix another common writer evolution issue, we could also add a > >> similar catch-all for `enum`s (optional, to be explicitly specified in > the > >> schema). > >> -Matthieu > >> > >> > >> > >>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <rb...@netflix.com.INVALID> > >> wrote: > >>> > >>> Yacine, > >>> > >>> Thanks for the extra information. Sorry for my delay in replying, I > >> wanted > >>> to time think about your suggestion. > >>> > >>> I see what you mean that you can think of a union as the superclass of > >> its > >>> options. The reflect object model has test code that does just that, > >> where > >>> classes B and C inherit from A and the schema for A is created as a > union > >>> of B and C. But, I don't think that your suggestion aligns with the > >>> expectations of object oriented design. Maybe that's an easier way to > >>> present my concern: > >>> > >>> Say I have a class, Vehicle, with subclasses Car and Truck. I have > >>> applications that work with my dataset, the vehicles that my company > >> owns, > >>> and we buy a bus. If I add a Bus class, what normally happens is that > an > >>> older application can work with it. A maintenance tracker would call > >>> getVehicles and can still get the lastMaintenanceDate for my Bus, even > >>> though it doesn't know about busses. But what you suggest is that it is > >>> replaced with a default, say null, in cases like this. > >>> > >>> I think that the problem is that Avro no equivalent concept of > >> inheritance. > >>> There's only one way to represent it for what you need right now, like > >>> Matthieu suggested. I think that substituting different data for > >>> unrecognized branches in a union isn't the way to fix the problem, but > I > >>> have a proposal for a way to fix it that looks more like you'd expect > in > >>> the OO example above: by adding the Vehicle class to your read schema. > >>> > >>> Right now, unions are first resolved by full class name as required by > >> the > >>> spec. But after that, we have some additional rules to match schemas. > >> These > >>> rules are how we reconcile name differences from situations like > writing > >>> with a generic class and reading with a specific class. I'm proposing > you > >>> use a catch-all class (the superclass) with fields that are in all of > the > >>> union's branches, and we update schema resolution to allow it. So you'd > >>> have... > >>> > >>> record Vehicle { > >>> long lastMaintenanceDate; > >>> } > >>> > >>> record Car { > >>> long lastMaintenanceDate; > >>> int maxChildSeats; > >>> } > >>> > >>> record Truck { > >>> long lastMainetanceDate; > >>> long lastWashedDate; > >>> } > >>> > >>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The > >>> catch-all class could provide the behavior you want, without silently > >>> giving you incorrect data. > >>> > >>> Does that sound like a reasonable solution for your use case? > >>> > >>> rb > >>> > >>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <mon...@alum.mit.edu > > > >>> wrote: > >>> > >>>> Hi Yacine, > >>>> > >>>> I believe Ryan was mentioning that if you start from a schema > `[“null”, > >>>> “Car”]` then rather than add a new bus branch to the union, you could > >>>> update the car’s schema to include new bus fields. For example (using > >> IDL > >>>> notation): > >>>> > >>>>> // Initial approach > >>>>> // Evolved to union { null, Car, Bus } > >>>>> record Car { > >>>>> string vin; > >>>>> } > >>>>> record Bus { > >>>>> string vin; > >>>>> int capacity; > >>>>> } > >>>>> > >>>>> // Alternative approach > >>>>> // Here evolution wouldn't change the union { null, Vehicle } > >>>>> // Note also that this is actually a compatible evolution from the > >> first > >>>> approach, even under current rules (using aliases for name changes). > >>>>> record Vehicle { > >>>>> string vin; // (Would be optional if not all branches had it.) > >>>>> union { null, int } capacity; // The new field is added here. > >>>>> } > >>>> > >>>> Pushing this further, you could also directly start upfront with just > a > >>>> “Vehicle” record schema with an optional `car` field and add new > >> optional > >>>> fields as necessary (for example a `bus` field). This gives you much > >>>> greater flexibility for evolving readers and writers separately (but > you > >>>> lose the ability to enforce that at most one field is ever present in > >> the > >>>> record). > >>>> > >>>> I agree that Avro’s current evolution rules are more geared towards > >> reader > >>>> evolution: evolving writers while keeping readers constant is > difficult > >>>> (namely because they don't support adding branches to unions or > symbols > >> to > >>>> enums). However, adding a completely new and separate "compatibility > >> mode” > >>>> has a high complexity cost; both for implementors (the separate > classes > >> you > >>>> mention) and users (who must invoke them specifically). It would be > >> best to > >>>> keep evolution rules universal. > >>>> > >>>> Maybe we could extend the logic behind field defaults? Enum schemas > >> could > >>>> simply contain a default symbol attribute. For unions, it’s a bit > >> trickier, > >>>> but we should be able to get around with a default branch attribute > >> which > >>>> would allow evolution if and only if the newly added branches can be > >> read > >>>> as the default branch. Assuming the attribute names don’t already > exist, > >>>> this would be compatible with the current behavior. > >>>> > >>>> I think this use-case is common enough that it’d be worth exploring > (for > >>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address > this > >>>> issue; though not completely unfortunately [2]). > >>>> > >>>> -Matthieu > >>>> > >>>> [1] https://github.com/linkedin/rest.li > >>>> [2] > >>>> > >> > https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible > >>>> > >>>> > >>>> > >>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane < > >>>> ybenabderrahm...@octo.com> wrote: > >>>>> > >>>>> Hi Ryan, > >>>>> > >>>>> Just a little up^^. Could you please (or anyone else) give me a > little > >>>>> feedback ? > >>>>> Thanks in advance. > >>>>> > >>>>> Regards, > >>>>> Yacine. > >>>>> > >>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane < > >>>> ybenabderrahm...@octo.com > >>>>>> : > >>>>> > >>>>>> Hi Ryan, > >>>>>> > >>>>>> Thank you for giving feedback. I will try in the following to > provide > >>>> you > >>>>>> some more details about the addressed problem. > >>>>>> > >>>>>> But before that, just a brief reminder of the context. Avro has been > >>>>>> chosen in this project (and by many other ones for sure) especially > >> for > >>>> a > >>>>>> very important feature: enabling forward and backward compatibility > >>>>>> management through schema life-cycle. Our development model involves > >>>>>> intensive usage of this feature, and many heavy developments are > made > >> in > >>>>>> parallel streams inside feature teams that share the same schema, > >>>> provided > >>>>>> the evolution of the latter complies with the stated compatibility > >>>> rules. > >>>>>> This implies that all the entities supported by the Avro schema must > >>>>>> support the two-way compatibility, including unions. However, in the > >>>>>> special case of the union, this two-way compatibility is not well > >>>> supported > >>>>>> by the current rules. Let me explain you the basement of our point > of > >>>> view, > >>>>>> it remains quite simple. > >>>>>> > >>>>>> The use case is to have, for example, > >>>>>> - a first union version A: > >>>>>> { "name": "Vehicle", > >>>>>> "type": ["null", "Car"] } > >>>>>> - a new version of it B: > >>>>>> { "name": "Vehicle", > >>>>>> "type": ["null", "Car", "Bus"] } > >>>>>> For being forward compatible, an evolution of the union schema must > >>>>>> guarantee that an old reader reading with A can read the data > written > >>>> with > >>>>>> the new schema B. Getting an error just means that the forward > >>>>>> compatibility feature is broken. But this is not actually the case > >> (and > >>>>>> this behavior is not suitable), because the old reader has a correct > >>>> schema > >>>>>> and this schema has evolved naturally to version B to incorporate a > >> new > >>>>>> Vehicle type. Not knowing this new type must not produce an error, > but > >>>> just > >>>>>> give the reader a default value, which means: "Either the data is > not > >>>> there > >>>>>> or you do not know how to handle it". > >>>>>> > >>>>>> This is thought while keeping in mind that in an object-oriented > code > >>>>>> modeling, a union field is seen as a class member with the higher > >> level > >>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it > is > >>>>>> natural for a modeler / programmer to implement the ability of not > >>>> getting > >>>>>> the awaited types and using some default value of known type. To > give > >> a > >>>>>> more complete specification, the new mode of compatibility has to > >> impose > >>>>>> one rule: the union default value must not change through versions > and > >>>> the > >>>>>> corresponding type must be placed at the top of the types list. This > >> is > >>>>>> much easier to handle by development streams, because it is > addressed > >>>> once > >>>>>> for all in the very beginning of the schema life-cycle, than the > fact > >> to > >>>>>> oblige a number of teams, among which some are just not in place > >>>> anymore, > >>>>>> to update the whole code just because another dev team has deployed > a > >>>> new > >>>>>> version of the union in the schema. > >>>>>> > >>>>>> Now, for being backward compatible, the reader with B must always be > >>>> able > >>>>>> to read data written with schema A. Even if the type included in the > >>>> data > >>>>>> is not known, so it gets the default value and not an error. > >>>>>> > >>>>>> I understand that getting an error could make sense when the > requested > >>>>>> field is not present. However, this behavior: > >>>>>> > >>>>>> - is very restrictive, meaning: this obliges the old reader to > update > >>>>>> its code for integrating the new schema, while he is not managing to > >>>> do it > >>>>>> for many reasons: development stream of next delivery is not > >>>> finished, or > >>>>>> not engaged, or not even planned - in the case of old and stable > code > >>>>>> - breaks the forward compatibility feature: the older reader is not > >>>>>> able to read the new version of the union without getting an error > >>>>>> - breaks the backward compatibility feature: the new reader is not > >>>>>> able to read an old version containing unknown types of the union > >>>> without > >>>>>> getting an error > >>>>>> > >>>>>> By the way, what do you exactly mean by "pushing evolution lower" > and > >>>>>> "update the record"? Could you please give me an example of the > trick > >>>> you > >>>>>> are talking about? > >>>>>> > >>>>>> Just to be a bit more precise, we are not targeting to use a > "trick". > >>>> This > >>>>>> life-cycle management should be included in a standard so to keep > the > >>>>>> software development clean, production safe and compliant with a > >> complex > >>>>>> product road-map. > >>>>>> > >>>>>> Finally, you seem to be concerned by the "significant change to the > >>>>>> current evolution rules". Well, we actually do not change these > rules, > >>>> they > >>>>>> keep just the same. All we are proposing is to introduce a *mode* > >> where > >>>>>> the rules of union compatibility change. This mode is materialized > by > >> a > >>>>>> minimum and thin impact of the existing classes without any change > in > >>>> the > >>>>>> behavior, all the logic of the new compatibility mode is implemented > >> by > >>>> new > >>>>>> classes that must be invoked specifically. But you would better see > it > >>>> in > >>>>>> the code patch. > >>>>>> > >>>>>> Looking forward to reading your feedback and answers. > >>>>>> > >>>>>> Regards, > >>>>>> Yacine. > >>>>>> > >>>>>> > >>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rb...@netflix.com.invalid>: > >>>>>> > >>>>>>> Hi Yacine, > >>>>>>> > >>>>>>> Thanks for the proposal. It sounds interesting, but I want to make > >> sure > >>>>>>> there's a clear use case for this because it's a significant change > >> to > >>>> the > >>>>>>> current evolution rules. Right now we guarantee that a reader will > >> get > >>>> an > >>>>>>> error if the data has an unknown union branch rather than getting a > >>>>>>> default > >>>>>>> value. I think that makes sense: if the reader is requesting a > field, > >>>> it > >>>>>>> should get the actual datum for it rather than a default because it > >>>>>>> doesn't > >>>>>>> know how to handle it. > >>>>>>> > >>>>>>> Could you give us an example use case that requires this new logic? > >>>>>>> > >>>>>>> I just want to make sure we can't solve your problem another way. > For > >>>>>>> example, pushing evolution lower in the schema usually does the > >> trick: > >>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1", > >>>>>>> "RecordV2"], it is usually better to update the record so that > older > >>>>>>> readers can ignore the new fields. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> rb > >>>>>>> > >>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane < > >>>>>>> ybenabderrahm...@octo.com> wrote: > >>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> In order to provide a solution to the union schema evolution > >> problem, > >>>>>>> as it > >>>>>>>> was earlier notified in the thread "add a type to a union > >>>>>>>> < > >>>>>>>> > >>>>>>> > >>>> > >> > http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union > >>>>>>>>> " > >>>>>>>> of the user mailing list, we decided, for the needs of the > reactive > >>>>>>>> architecture we have implemented for one of our clients, to > >> implement > >>>> an > >>>>>>>> evolution of the compatibility principle of Avro when using > Unions. > >>>> For > >>>>>>>> reminder, the asked question was about the way to handle the case > >>>> where > >>>>>>> a > >>>>>>>> reader, using an old version of a schema that includes a union, > >> reads > >>>>>>> some > >>>>>>>> data written with a new version of the schema where a type has > been > >>>>>>> added > >>>>>>>> to the union. > >>>>>>>> > >>>>>>>> As answered by Martin Kleppman in that thread, one way to handle > >> this > >>>>>>> kind > >>>>>>>> of evolution (a new version of the schema adds a new type type in > a > >>>>>>> union) > >>>>>>>> would be to ensure that all the development streams have > integrated > >>>> the > >>>>>>> new > >>>>>>>> schema B before deploying it in the IT schema referential. > >>>>>>>> However, in big structures involving strongly uncorrelated teams > (in > >>>> the > >>>>>>>> product life-cycle point of view), this approach appears to be > quite > >>>>>>>> impracticable, causing production stream congestion, blocking > >> behavior > >>>>>>>> between teams, and a bunch of other > >>>>>>>> unwanted-counter-agile-/-reactive-phenomena... > >>>>>>>> > >>>>>>>> Therefore, we had to implement a new *compatibility* *mode* for > the > >>>>>>> unions, > >>>>>>>> while taking care to comply with the following rules: > >>>>>>>> > >>>>>>>> 1. Clear rules of compatibility are stated and integrated for this > >>>>>>>> compatibility mode > >>>>>>>> 2. The standard Avro behavior must be kept intact > >>>>>>>> 3. All the evolution implementation must be done without > >> introducing > >>>>>>> any > >>>>>>>> regression (all existing tests of the Avro stack must succeed) > >>>>>>>> 4. The code impact on Avro stack must be minimized > >>>>>>>> > >>>>>>>> Just to give you a very brief overview (as I don't know if this is > >>>>>>> actually > >>>>>>>> the place for a full detailed description), the evolution > addresses > >>>> the > >>>>>>>> typical problem where two development streams use the same schema > >> but > >>>> in > >>>>>>>> different versions, in the case described shortly as follows: > >>>>>>>> > >>>>>>>> - The first development stream, called "DevA", uses the version A > >> of > >>>>>>> a > >>>>>>>> schema which integrates a union referencing two types, say "null" > >>>> and > >>>>>>>> "string". The default value is set to null. > >>>>>>>> - The second development team, called "DevB", uses the version B, > >>>>>>> which > >>>>>>>> is an evolution of the version A, as it adds a reference to a new > >>>>>>> type > >>>>>>>> in > >>>>>>>> the former union, say "long" (which makes it "null", string" and > >>>>>>> "long") > >>>>>>>> - When the schema B is deployed on the schema referential (in our > >>>>>>> case, > >>>>>>>> the IO Confluent Schema Registry) subsequently to the version A > >>>>>>>> - The stream "DevA" must be able to read with schema A, even if > >>>>>>> the > >>>>>>>> data has been written using the schema B with the type "long" > in > >>>>>>>> the union. > >>>>>>>> In the latter case, the read value is the union default value > >>>>>>>> - The stream "DevB" must be able to read/write with schema B, > >>>>>>> even if > >>>>>>>> it writes the data using the type "long" in the union > >>>>>>>> > >>>>>>>> The evolution that we implemented for this mode includes some > rules > >>>> that > >>>>>>>> are based on the principles stated in the Avro documentation. It > is > >>>> even > >>>>>>>> more powerful than showed in the few lines above, as it enables > the > >>>>>>> readers > >>>>>>>> to get the default value of the union if the schema used for > reading > >>>>>>> does > >>>>>>>> not contain the type used by the writer in the union. This > achieves > >> a > >>>>>>> new > >>>>>>>> mode of forward / backward compatibility. This evolution is for > now > >>>>>>> working > >>>>>>>> perfectly, and should be on production in the few coming weeks. We > >>>> have > >>>>>>>> also made an evolution of the IO Confluent Schema Registry stack > to > >>>>>>> support > >>>>>>>> it, again in a transparent manner (we also intend to contribute to > >>>> this > >>>>>>>> stack in a second / parallel step). > >>>>>>>> > >>>>>>>> In the objective of contributing to the Avro stack with this new > >>>>>>>> compatibility mode for unions, I have some questions about the > >>>>>>> procedure: > >>>>>>>> > >>>>>>>> 1. How can I achieve the contribution proposal? Should I directly > >>>>>>>> provide a patch in JIRA and dive into the details right there? > >>>>>>>> 2. The base version of this evolution is 1.7.7, is it eligible to > >>>>>>>> contribution evaluation anyway? > >>>>>>>> > >>>>>>>> Thanks in advance, looking forward to hearing from you and giving > >> you > >>>>>>> more > >>>>>>>> details. > >>>>>>>> > >>>>>>>> Kind Regards, > >>>>>>>> -- > >>>>>>>> *Yacine Benabderrahmane* > >>>>>>>> Architect > >>>>>>>> *OCTO Technology* > >>>>>>>> <http://www.octo.com> > >>>>>>>> ----------------------------------------------- > >>>>>>>> Tel : +33 6 10 88 25 98 > >>>>>>>> 50 avenue des Champs Elysées > >>>>>>>> 75008 PARIS > >>>>>>>> www.octo.com > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Ryan Blue > >>>>>>> Software Engineer > >>>>>>> Netflix > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> Ryan Blue > >>> Software Engineer > >>> Netflix > >> > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix