Thanks, Zoltan! Good to know there's an issue open for the enum enhancement and that you already suggested the fix we came up with in this thread. Lets use AVRO-1340 to track this.
rb On Sun, Apr 17, 2016 at 4:49 AM, Zoltan Farkas <zolyfar...@yahoo.com.invalid > wrote: > +1 , it would resolve AVRO-1340 as well! > > --z > > > > > On Apr 16, 2016, at 10:26 PM, Ryan Blue <rb...@netflix.com.INVALID> > wrote: > > > > +1 > > > > On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch <mon...@alum.mit.edu> > > wrote: > > > >>> I think it may make sense for > >>> enums to add a special value for it. > >> > >> That would work great. To give slightly more flexibility to users, we > >> could even allow the reader’s schema to explicitly specify which symbol > to > >> use when it reads an unknown symbol. If not specified, resolution would > >> fail (consistent with the current behavior). > >> > >> For example (assuming the corresponding key is called “onUnknown”): > >> > >>> { > >>> “type”: “enum”, > >>> “name”: “Platform”, > >>> “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”], > >>> “onUnknown”: “UNSUPPORTED" > >>> } > >> > >> This `enum` would then be able to resolve schemas with extra symbols > (read > >> as `UNSUPPORTED`). > >> > >> > >> > >>>> On Apr 16, 2016, at 1:36 PM, Ryan Blue <rb...@netflix.com.INVALID> > >>> wrote: > >>> > >>> Matthieu, how would that work with enums? I think it may make sense for > >>> enums to add a special value for it. > >>> > >>> rb > >>> > >>> On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <mon...@alum.mit.edu> > >>> wrote: > >>> > >>>>> I think that substituting different data for > >>>>> unrecognized branches in a union isn't the way to fix the problem, > but > >> I > >>>>> have a proposal for a way to fix it that looks more like you'd expect > >> in > >>>>> the OO example above: by adding the Vehicle class to your read > schema. > >>>>> > >>>>> Right now, unions are first resolved by full class name as required > by > >>>> the > >>>>> spec. But after that, we have some additional rules to match schemas. > >>>> These > >>>>> rules are how we reconcile name differences from situations like > >> writing > >>>>> with a generic class and reading with a specific class. I'm proposing > >> you > >>>>> use a catch-all class (the superclass) with fields that are in all of > >> the > >>>>> union's branches, and we update schema resolution to allow it. > >>>> > >>>> > >>>> That sounds good. The only thing I’d add is to make this catch-all > >>>> behavior explicit in the schema (similar to how field defaults must be > >>>> explicitly added). > >>>> To help fix another common writer evolution issue, we could also add a > >>>> similar catch-all for `enum`s (optional, to be explicitly specified in > >> the > >>>> schema). > >>>> -Matthieu > >>>> > >>>> > >>>> > >>>>>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <rb...@netflix.com.INVALID> > >>>>> wrote: > >>>>> > >>>>> Yacine, > >>>>> > >>>>> Thanks for the extra information. Sorry for my delay in replying, I > >>>> wanted > >>>>> to time think about your suggestion. > >>>>> > >>>>> I see what you mean that you can think of a union as the superclass > of > >>>> its > >>>>> options. The reflect object model has test code that does just that, > >>>> where > >>>>> classes B and C inherit from A and the schema for A is created as a > >> union > >>>>> of B and C. But, I don't think that your suggestion aligns with the > >>>>> expectations of object oriented design. Maybe that's an easier way to > >>>>> present my concern: > >>>>> > >>>>> Say I have a class, Vehicle, with subclasses Car and Truck. I have > >>>>> applications that work with my dataset, the vehicles that my company > >>>> owns, > >>>>> and we buy a bus. If I add a Bus class, what normally happens is that > >> an > >>>>> older application can work with it. A maintenance tracker would call > >>>>> getVehicles and can still get the lastMaintenanceDate for my Bus, > even > >>>>> though it doesn't know about busses. But what you suggest is that it > is > >>>>> replaced with a default, say null, in cases like this. > >>>>> > >>>>> I think that the problem is that Avro no equivalent concept of > >>>> inheritance. > >>>>> There's only one way to represent it for what you need right now, > like > >>>>> Matthieu suggested. I think that substituting different data for > >>>>> unrecognized branches in a union isn't the way to fix the problem, > but > >> I > >>>>> have a proposal for a way to fix it that looks more like you'd expect > >> in > >>>>> the OO example above: by adding the Vehicle class to your read > schema. > >>>>> > >>>>> Right now, unions are first resolved by full class name as required > by > >>>> the > >>>>> spec. But after that, we have some additional rules to match schemas. > >>>> These > >>>>> rules are how we reconcile name differences from situations like > >> writing > >>>>> with a generic class and reading with a specific class. I'm proposing > >> you > >>>>> use a catch-all class (the superclass) with fields that are in all of > >> the > >>>>> union's branches, and we update schema resolution to allow it. So > you'd > >>>>> have... > >>>>> > >>>>> record Vehicle { > >>>>> long lastMaintenanceDate; > >>>>> } > >>>>> > >>>>> record Car { > >>>>> long lastMaintenanceDate; > >>>>> int maxChildSeats; > >>>>> } > >>>>> > >>>>> record Truck { > >>>>> long lastMainetanceDate; > >>>>> long lastWashedDate; > >>>>> } > >>>>> > >>>>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. > The > >>>>> catch-all class could provide the behavior you want, without silently > >>>>> giving you incorrect data. > >>>>> > >>>>> Does that sound like a reasonable solution for your use case? > >>>>> > >>>>> rb > >>>>> > >>>>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch < > mon...@alum.mit.edu > >>> > >>>>> wrote: > >>>>> > >>>>>> Hi Yacine, > >>>>>> > >>>>>> I believe Ryan was mentioning that if you start from a schema > >> `[“null”, > >>>>>> “Car”]` then rather than add a new bus branch to the union, you > could > >>>>>> update the car’s schema to include new bus fields. For example > (using > >>>> IDL > >>>>>> notation): > >>>>>> > >>>>>>> // Initial approach > >>>>>>> // Evolved to union { null, Car, Bus } > >>>>>>> record Car { > >>>>>>> string vin; > >>>>>>> } > >>>>>>> record Bus { > >>>>>>> string vin; > >>>>>>> int capacity; > >>>>>>> } > >>>>>>> > >>>>>>> // Alternative approach > >>>>>>> // Here evolution wouldn't change the union { null, Vehicle } > >>>>>>> // Note also that this is actually a compatible evolution from the > >>>> first > >>>>>> approach, even under current rules (using aliases for name changes). > >>>>>>> record Vehicle { > >>>>>>> string vin; // (Would be optional if not all branches had it.) > >>>>>>> union { null, int } capacity; // The new field is added here. > >>>>>>> } > >>>>>> > >>>>>> Pushing this further, you could also directly start upfront with > just > >> a > >>>>>> “Vehicle” record schema with an optional `car` field and add new > >>>> optional > >>>>>> fields as necessary (for example a `bus` field). This gives you much > >>>>>> greater flexibility for evolving readers and writers separately (but > >> you > >>>>>> lose the ability to enforce that at most one field is ever present > in > >>>> the > >>>>>> record). > >>>>>> > >>>>>> I agree that Avro’s current evolution rules are more geared towards > >>>> reader > >>>>>> evolution: evolving writers while keeping readers constant is > >> difficult > >>>>>> (namely because they don't support adding branches to unions or > >> symbols > >>>> to > >>>>>> enums). However, adding a completely new and separate "compatibility > >>>> mode” > >>>>>> has a high complexity cost; both for implementors (the separate > >> classes > >>>> you > >>>>>> mention) and users (who must invoke them specifically). It would be > >>>> best to > >>>>>> keep evolution rules universal. > >>>>>> > >>>>>> Maybe we could extend the logic behind field defaults? Enum schemas > >>>> could > >>>>>> simply contain a default symbol attribute. For unions, it’s a bit > >>>> trickier, > >>>>>> but we should be able to get around with a default branch attribute > >>>> which > >>>>>> would allow evolution if and only if the newly added branches can be > >>>> read > >>>>>> as the default branch. Assuming the attribute names don’t already > >> exist, > >>>>>> this would be compatible with the current behavior. > >>>>>> > >>>>>> I think this use-case is common enough that it’d be worth exploring > >> (for > >>>>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address > >> this > >>>>>> issue; though not completely unfortunately [2]). > >>>>>> > >>>>>> -Matthieu > >>>>>> > >>>>>> [1] https://github.com/linkedin/rest.li > >>>>>> [2] > >> > https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible > >>>>>> > >>>>>> > >>>>>> > >>>>>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane < > >>>>>>> ybenabderrahm...@octo.com> wrote: > >>>>>>> > >>>>>>> Hi Ryan, > >>>>>>> > >>>>>>> Just a little up^^. Could you please (or anyone else) give me a > >> little > >>>>>>> feedback ? > >>>>>>> Thanks in advance. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Yacine. > >>>>>>> > >>>>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane < > >>>>>> ybenabderrahm...@octo.com > >>>>>>>> : > >>>>>>> > >>>>>>>> Hi Ryan, > >>>>>>>> > >>>>>>>> Thank you for giving feedback. I will try in the following to > >> provide > >>>>>> you > >>>>>>>> some more details about the addressed problem. > >>>>>>>> > >>>>>>>> But before that, just a brief reminder of the context. Avro has > been > >>>>>>>> chosen in this project (and by many other ones for sure) > especially > >>>> for > >>>>>> a > >>>>>>>> very important feature: enabling forward and backward > compatibility > >>>>>>>> management through schema life-cycle. Our development model > involves > >>>>>>>> intensive usage of this feature, and many heavy developments are > >> made > >>>> in > >>>>>>>> parallel streams inside feature teams that share the same schema, > >>>>>> provided > >>>>>>>> the evolution of the latter complies with the stated compatibility > >>>>>> rules. > >>>>>>>> This implies that all the entities supported by the Avro schema > must > >>>>>>>> support the two-way compatibility, including unions. However, in > the > >>>>>>>> special case of the union, this two-way compatibility is not well > >>>>>> supported > >>>>>>>> by the current rules. Let me explain you the basement of our point > >> of > >>>>>> view, > >>>>>>>> it remains quite simple. > >>>>>>>> > >>>>>>>> The use case is to have, for example, > >>>>>>>> - a first union version A: > >>>>>>>> { "name": "Vehicle", > >>>>>>>> "type": ["null", "Car"] } > >>>>>>>> - a new version of it B: > >>>>>>>> { "name": "Vehicle", > >>>>>>>> "type": ["null", "Car", "Bus"] } > >>>>>>>> For being forward compatible, an evolution of the union schema > must > >>>>>>>> guarantee that an old reader reading with A can read the data > >> written > >>>>>> with > >>>>>>>> the new schema B. Getting an error just means that the forward > >>>>>>>> compatibility feature is broken. But this is not actually the case > >>>> (and > >>>>>>>> this behavior is not suitable), because the old reader has a > correct > >>>>>> schema > >>>>>>>> and this schema has evolved naturally to version B to incorporate > a > >>>> new > >>>>>>>> Vehicle type. Not knowing this new type must not produce an error, > >> but > >>>>>> just > >>>>>>>> give the reader a default value, which means: "Either the data is > >> not > >>>>>> there > >>>>>>>> or you do not know how to handle it". > >>>>>>>> > >>>>>>>> This is thought while keeping in mind that in an object-oriented > >> code > >>>>>>>> modeling, a union field is seen as a class member with the higher > >>>> level > >>>>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, > it > >> is > >>>>>>>> natural for a modeler / programmer to implement the ability of not > >>>>>> getting > >>>>>>>> the awaited types and using some default value of known type. To > >> give > >>>> a > >>>>>>>> more complete specification, the new mode of compatibility has to > >>>> impose > >>>>>>>> one rule: the union default value must not change through versions > >> and > >>>>>> the > >>>>>>>> corresponding type must be placed at the top of the types list. > This > >>>> is > >>>>>>>> much easier to handle by development streams, because it is > >> addressed > >>>>>> once > >>>>>>>> for all in the very beginning of the schema life-cycle, than the > >> fact > >>>> to > >>>>>>>> oblige a number of teams, among which some are just not in place > >>>>>> anymore, > >>>>>>>> to update the whole code just because another dev team has > deployed > >> a > >>>>>> new > >>>>>>>> version of the union in the schema. > >>>>>>>> > >>>>>>>> Now, for being backward compatible, the reader with B must always > be > >>>>>> able > >>>>>>>> to read data written with schema A. Even if the type included in > the > >>>>>> data > >>>>>>>> is not known, so it gets the default value and not an error. > >>>>>>>> > >>>>>>>> I understand that getting an error could make sense when the > >> requested > >>>>>>>> field is not present. However, this behavior: > >>>>>>>> > >>>>>>>> - is very restrictive, meaning: this obliges the old reader to > >> update > >>>>>>>> its code for integrating the new schema, while he is not managing > to > >>>>>> do it > >>>>>>>> for many reasons: development stream of next delivery is not > >>>>>> finished, or > >>>>>>>> not engaged, or not even planned - in the case of old and stable > >> code > >>>>>>>> - breaks the forward compatibility feature: the older reader is > not > >>>>>>>> able to read the new version of the union without getting an error > >>>>>>>> - breaks the backward compatibility feature: the new reader is not > >>>>>>>> able to read an old version containing unknown types of the union > >>>>>> without > >>>>>>>> getting an error > >>>>>>>> > >>>>>>>> By the way, what do you exactly mean by "pushing evolution lower" > >> and > >>>>>>>> "update the record"? Could you please give me an example of the > >> trick > >>>>>> you > >>>>>>>> are talking about? > >>>>>>>> > >>>>>>>> Just to be a bit more precise, we are not targeting to use a > >> "trick". > >>>>>> This > >>>>>>>> life-cycle management should be included in a standard so to keep > >> the > >>>>>>>> software development clean, production safe and compliant with a > >>>> complex > >>>>>>>> product road-map. > >>>>>>>> > >>>>>>>> Finally, you seem to be concerned by the "significant change to > the > >>>>>>>> current evolution rules". Well, we actually do not change these > >> rules, > >>>>>> they > >>>>>>>> keep just the same. All we are proposing is to introduce a *mode* > >>>> where > >>>>>>>> the rules of union compatibility change. This mode is materialized > >> by > >>>> a > >>>>>>>> minimum and thin impact of the existing classes without any change > >> in > >>>>>> the > >>>>>>>> behavior, all the logic of the new compatibility mode is > implemented > >>>> by > >>>>>> new > >>>>>>>> classes that must be invoked specifically. But you would better > see > >> it > >>>>>> in > >>>>>>>> the code patch. > >>>>>>>> > >>>>>>>> Looking forward to reading your feedback and answers. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Yacine. > >>>>>>>> > >>>>>>>> > >>>>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <rb...@netflix.com.invalid>: > >>>>>>>> > >>>>>>>>> Hi Yacine, > >>>>>>>>> > >>>>>>>>> Thanks for the proposal. It sounds interesting, but I want to > make > >>>> sure > >>>>>>>>> there's a clear use case for this because it's a significant > change > >>>> to > >>>>>> the > >>>>>>>>> current evolution rules. Right now we guarantee that a reader > will > >>>> get > >>>>>> an > >>>>>>>>> error if the data has an unknown union branch rather than > getting a > >>>>>>>>> default > >>>>>>>>> value. I think that makes sense: if the reader is requesting a > >> field, > >>>>>> it > >>>>>>>>> should get the actual datum for it rather than a default because > it > >>>>>>>>> doesn't > >>>>>>>>> know how to handle it. > >>>>>>>>> > >>>>>>>>> Could you give us an example use case that requires this new > logic? > >>>>>>>>> > >>>>>>>>> I just want to make sure we can't solve your problem another way. > >> For > >>>>>>>>> example, pushing evolution lower in the schema usually does the > >>>> trick: > >>>>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1", > >>>>>>>>> "RecordV2"], it is usually better to update the record so that > >> older > >>>>>>>>> readers can ignore the new fields. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> rb > >>>>>>>>> > >>>>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane < > >>>>>>>>> ybenabderrahm...@octo.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> In order to provide a solution to the union schema evolution > >>>> problem, > >>>>>>>>> as it > >>>>>>>>>> was earlier notified in the thread "add a type to a union > >>>>>>>>>> < > >> > http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union > >>>>>>>>>>> " > >>>>>>>>>> of the user mailing list, we decided, for the needs of the > >> reactive > >>>>>>>>>> architecture we have implemented for one of our clients, to > >>>> implement > >>>>>> an > >>>>>>>>>> evolution of the compatibility principle of Avro when using > >> Unions. > >>>>>> For > >>>>>>>>>> reminder, the asked question was about the way to handle the > case > >>>>>> where > >>>>>>>>> a > >>>>>>>>>> reader, using an old version of a schema that includes a union, > >>>> reads > >>>>>>>>> some > >>>>>>>>>> data written with a new version of the schema where a type has > >> been > >>>>>>>>> added > >>>>>>>>>> to the union. > >>>>>>>>>> > >>>>>>>>>> As answered by Martin Kleppman in that thread, one way to handle > >>>> this > >>>>>>>>> kind > >>>>>>>>>> of evolution (a new version of the schema adds a new type type > in > >> a > >>>>>>>>> union) > >>>>>>>>>> would be to ensure that all the development streams have > >> integrated > >>>>>> the > >>>>>>>>> new > >>>>>>>>>> schema B before deploying it in the IT schema referential. > >>>>>>>>>> However, in big structures involving strongly uncorrelated teams > >> (in > >>>>>> the > >>>>>>>>>> product life-cycle point of view), this approach appears to be > >> quite > >>>>>>>>>> impracticable, causing production stream congestion, blocking > >>>> behavior > >>>>>>>>>> between teams, and a bunch of other > >>>>>>>>>> unwanted-counter-agile-/-reactive-phenomena... > >>>>>>>>>> > >>>>>>>>>> Therefore, we had to implement a new *compatibility* *mode* for > >> the > >>>>>>>>> unions, > >>>>>>>>>> while taking care to comply with the following rules: > >>>>>>>>>> > >>>>>>>>>> 1. Clear rules of compatibility are stated and integrated for > this > >>>>>>>>>> compatibility mode > >>>>>>>>>> 2. The standard Avro behavior must be kept intact > >>>>>>>>>> 3. All the evolution implementation must be done without > >>>> introducing > >>>>>>>>> any > >>>>>>>>>> regression (all existing tests of the Avro stack must succeed) > >>>>>>>>>> 4. The code impact on Avro stack must be minimized > >>>>>>>>>> > >>>>>>>>>> Just to give you a very brief overview (as I don't know if this > is > >>>>>>>>> actually > >>>>>>>>>> the place for a full detailed description), the evolution > >> addresses > >>>>>> the > >>>>>>>>>> typical problem where two development streams use the same > schema > >>>> but > >>>>>> in > >>>>>>>>>> different versions, in the case described shortly as follows: > >>>>>>>>>> > >>>>>>>>>> - The first development stream, called "DevA", uses the version > A > >>>> of > >>>>>>>>> a > >>>>>>>>>> schema which integrates a union referencing two types, say > "null" > >>>>>> and > >>>>>>>>>> "string". The default value is set to null. > >>>>>>>>>> - The second development team, called "DevB", uses the version > B, > >>>>>>>>> which > >>>>>>>>>> is an evolution of the version A, as it adds a reference to a > new > >>>>>>>>> type > >>>>>>>>>> in > >>>>>>>>>> the former union, say "long" (which makes it "null", string" and > >>>>>>>>> "long") > >>>>>>>>>> - When the schema B is deployed on the schema referential (in > our > >>>>>>>>> case, > >>>>>>>>>> the IO Confluent Schema Registry) subsequently to the version A > >>>>>>>>>> - The stream "DevA" must be able to read with schema A, even > if > >>>>>>>>> the > >>>>>>>>>> data has been written using the schema B with the type "long" > >> in > >>>>>>>>>> the union. > >>>>>>>>>> In the latter case, the read value is the union default value > >>>>>>>>>> - The stream "DevB" must be able to read/write with schema B, > >>>>>>>>> even if > >>>>>>>>>> it writes the data using the type "long" in the union > >>>>>>>>>> > >>>>>>>>>> The evolution that we implemented for this mode includes some > >> rules > >>>>>> that > >>>>>>>>>> are based on the principles stated in the Avro documentation. It > >> is > >>>>>> even > >>>>>>>>>> more powerful than showed in the few lines above, as it enables > >> the > >>>>>>>>> readers > >>>>>>>>>> to get the default value of the union if the schema used for > >> reading > >>>>>>>>> does > >>>>>>>>>> not contain the type used by the writer in the union. This > >> achieves > >>>> a > >>>>>>>>> new > >>>>>>>>>> mode of forward / backward compatibility. This evolution is for > >> now > >>>>>>>>> working > >>>>>>>>>> perfectly, and should be on production in the few coming weeks. > We > >>>>>> have > >>>>>>>>>> also made an evolution of the IO Confluent Schema Registry stack > >> to > >>>>>>>>> support > >>>>>>>>>> it, again in a transparent manner (we also intend to contribute > to > >>>>>> this > >>>>>>>>>> stack in a second / parallel step). > >>>>>>>>>> > >>>>>>>>>> In the objective of contributing to the Avro stack with this new > >>>>>>>>>> compatibility mode for unions, I have some questions about the > >>>>>>>>> procedure: > >>>>>>>>>> > >>>>>>>>>> 1. How can I achieve the contribution proposal? Should I > directly > >>>>>>>>>> provide a patch in JIRA and dive into the details right there? > >>>>>>>>>> 2. The base version of this evolution is 1.7.7, is it eligible > to > >>>>>>>>>> contribution evaluation anyway? > >>>>>>>>>> > >>>>>>>>>> Thanks in advance, looking forward to hearing from you and > giving > >>>> you > >>>>>>>>> more > >>>>>>>>>> details. > >>>>>>>>>> > >>>>>>>>>> Kind Regards, > >>>>>>>>>> -- > >>>>>>>>>> *Yacine Benabderrahmane* > >>>>>>>>>> Architect > >>>>>>>>>> *OCTO Technology* > >>>>>>>>>> <http://www.octo.com> > >>>>>>>>>> ----------------------------------------------- > >>>>>>>>>> Tel : +33 6 10 88 25 98 > >>>>>>>>>> 50 avenue des Champs Elysées > >>>>>>>>>> 75008 PARIS > >>>>>>>>>> www.octo.com > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Ryan Blue > >>>>>>>>> Software Engineer > >>>>>>>>> Netflix > >>>>> > >>>>> > >>>>> -- > >>>>> Ryan Blue > >>>>> Software Engineer > >>>>> Netflix > >>> > >>> > >>> -- > >>> Ryan Blue > >>> Software Engineer > >>> Netflix > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix