Hi, I was talking at the avro level. https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
Nonetheless, this deserves more holistic thinking. So look forward to the RFC. Thanks Vinoth On Fri, Feb 7, 2020 at 1:24 AM Pratyaksh Sharma <[email protected]> wrote: > @Nishith > > >> Hudi relies on Avro schema evolution rules which helps to prevent > breaking of existing queries on such tables > > I want to understand this statement from code's perspective. According to > what I know, in HoodieAvroUtils class, we are trying to validate the > rewritten record against Avro schema as given below - > > private static GenericRecord rewrite(GenericRecord record, Schema > schemaWithFields, Schema newSchema) { > GenericRecord newRecord = new GenericData.Record(newSchema); > for (Schema.Field f : schemaWithFields.getFields()) { > newRecord.put(f.name(), record.get(f.name())); > } > if (!GenericData.get().validate(newSchema, newRecord)) { > throw new SchemaCompatabilityException( > "Unable to validate the rewritten record " + record + " > against schema " + newSchema); > } > return newRecord; > } > > So I am trying to understand is there any place where we are actually > checking the compatibility of writer's and reader's schema in our code? The > above function simply validates the data types of field values and checks > if they are non-null. Also can someone explain the reason behind doing the > above validation? The record coming here gets created with original target > schema and newSchema simply includes hoodie metadata fields. So I feel this > check is redundant. > > On Fri, Feb 7, 2020 at 2:08 PM Pratyaksh Sharma <[email protected]> > wrote: > > > @Vinoth Chandar <[email protected]> How does re-ordering affect here > like > > you mentioned? Parquet files access fields by name rather than by index > by > > default. So re-ordering should not matter. Please help me understand. > > > > On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <[email protected]> > wrote: > > > >> @Pratyaksh Sharma <[email protected]> Please go ahead :) > >> > >> @Benoit , you are right about Parquet deletion, I think. > >> > >> Come to think of it, with an initial schema in place, how would we even > >> drop a field? all of the old data needs to be rewritten (prohibitively > >> expensive)? So all we will end up doing is simply mask the field from > >> queries by mapping old data to the current schema? This can get messy > >> pretty quickly if field re-ordering is allowed for e.g.. What we > do/advise > >> now is to alternatively embrace a more brittle schema management at the > >> write side (no renames, no dropping fields, all fields are nullable) and > >> ensure reader schema is simpler to manage.. There is probably a > >> middle-ground here somewhere/ > >> > >> > >> > >> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <[email protected] > > > >> wrote: > >> > >>> @Vinoth Chandar <[email protected]> I would like to drive this. > >>> > >>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <[email protected]> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> I think deleting field is supported with Avro both backward and > forward > >>>> as long as the field is optional and provide default value. > >>>> > >>>> A simple exemple of Avro optional field defined using a union type and > >>>> a default value: > >>>> { "name": "foo", "type": ["null", "string"], "default": "null" } > >>>> Readers will use default value when field is not present. > >>>> > >>>> I believe problem here is Parquet which does not support field > >>>> deletion. > >>>> One option is to set Parquet field value to null. Parquet will use RLE > >>>> encoding for efficient encoding of all null values in "deleted" field. > >>>> > >>>> Regards, > >>>> Benoit > >>>> > >>>> > On 6 Feb 2020, at 17:57, Nishith <[email protected]> wrote: > >>>> > > >>>> > >>>> Pratakysh, > >>>> > >>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on > >>>> Avro schema evolution rules which helps to prevent breaking of > existing > >>>> queries on such tables - say someone was querying that field that is > now > >>>> deleted. > >>>> You can read more here -> > https://avro.apache.org/docs/1.8.2/spec.html > >>>> That being said, I’m also looking at how we can support schema > >>>> evolution slightly differently - somethings could be more in our > control > >>>> and not break reader queries - but that’s not in the near future. > >>>> > >>>> Thanks > >>>> > >>>> Sent from my iPhone > >>>> > >>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma < > [email protected]> > >>>> wrote: > >>>> > > >>>> > Hi Vinoth, > >>>> > > >>>> > We do not have any standard documentation for the said approach as > it > >>>> was > >>>> > self thought through. Just logging a conversation from #general > >>>> channel for > >>>> > record purpose - > >>>> > > >>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but > >>>> I got > >>>> > an error and I didnt find any solution for this... I wrote some > >>>> parquet > >>>> > files with HUDI using INSERT_OPERATION_OPT_VAL, > >>>> MOR_STORAGE_TYPE_OPT_VAL > >>>> > and sync with hive and worked perfectly. But after that, I try to > >>>> wrote > >>>> > another file in the same table (with some schema changes, just > delete > >>>> and > >>>> > add some columns) and got this error Caused by: > >>>> > org.apache.parquet.io.InvalidRecordException: > >>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone > >>>> know > >>>> > what to do?" > >>>> > > >>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]> > >>>> wrote: > >>>> >> > >>>> >> In my experience, you need to follow some rules on evolving and > keep > >>>> the > >>>> >> data backwards compatible. Or the only other option is to rewrite > the > >>>> >> entire dataset :), which is very expensive. > >>>> >> > >>>> >> If you have some pointers to learn more about any approach you are > >>>> >> suggesting, happy to read up. > >>>> >> > >>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma < > >>>> [email protected]> > >>>> >> wrote: > >>>> >> > >>>> >>> Hi Vinoth, > >>>> >>> > >>>> >>> As you explained above and as per what is mentioned in this FAQ ( > >>>> >>> > >>>> >>> > >>>> >> > >>>> > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory > >>>> >>> ), > >>>> >>> Hudi is able to maintain schema evolution only if the schema is > >>>> >> *backwards > >>>> >>> compatible*. What about the case when it is backwards > incompatible? > >>>> This > >>>> >>> might be the case when for some reason you are unable to enforce > >>>> things > >>>> >>> like not deleting fields or not change the order. Ideally we > should > >>>> be > >>>> >> full > >>>> >>> proof and be able to support schema evolution in every case > >>>> possible. In > >>>> >>> such a case, creating a Uber schema can be useful. WDYT? > >>>> >>> > >>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected] > > > >>>> >> wrote: > >>>> >>> > >>>> >>>> Hi Syed, > >>>> >>>> > >>>> >>>> Typically, I have been the Confluent/avro schema registry used as > >>>> a the > >>>> >>>> source of truth and Hive schema is just a translation. Thats how > >>>> the > >>>> >>>> hudi-hive sync also works.. > >>>> >>>> Have you considered making fields optional in the avro schema so > >>>> that > >>>> >>> even > >>>> >>>> if the source data does not have few of them, there will be > nulls.. > >>>> >>>> In general, the two places I have dealt with this, all made it > >>>> works > >>>> >>> using > >>>> >>>> the schema evolution rules avro supports.. and enforcing things > >>>> like > >>>> >> not > >>>> >>>> deleting fields, not changing order etc. > >>>> >>>> > >>>> >>>> Hope that atleast helps a bit > >>>> >>>> > >>>> >>>> thanks > >>>> >>>> vinoth > >>>> >>>> > >>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather < > >>>> [email protected] > >>>> >>> > >>>> >>>> wrote: > >>>> >>>> > >>>> >>>>> Hi Team, > >>>> >>>>> > >>>> >>>>> We have pull data from Kafka generated by Debezium. The schema > >>>> >>> maintained > >>>> >>>>> in the schema registry by confluent framework during the > >>>> population > >>>> >> of > >>>> >>>>> data. > >>>> >>>>> > >>>> >>>>> *Problem Statement Here: * > >>>> >>>>> > >>>> >>>>> All the addition/deletion of columns is maintained in schema > >>>> >> registry. > >>>> >>>>> During running the Hudi pipeline, We have custom schema registry > >>>> >> that > >>>> >>>>> pulls the latest schema from the schema registry as well as from > >>>> hive > >>>> >>>>> metastore and we create a uber schema (so that missing the > columns > >>>> >> from > >>>> >>>> the > >>>> >>>>> schema registry will be pulled from hive metastore) But is there > >>>> any > >>>> >>>> better > >>>> >>>>> approach to solve this problem?. > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> Thanks and Regards, > >>>> >>>>> S SYED ABDUL KATHER > >>>> >>>>> > >>>> >>>> > >>>> >>> > >>>> >> > >>>> > >>> >
