Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Vinoth Chandar Wed, 12 Feb 2020 13:22:49 -0800

Hi,

I was talking at the avro level.
https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas


Nonetheless, this deserves more holistic thinking. So look forward to the
RFC.

Thanks
Vinoth

On Fri, Feb 7, 2020 at 1:24 AM Pratyaksh Sharma <[email protected]>
wrote:

> @Nishith
>
> >> Hudi relies on Avro schema evolution rules which helps to prevent
> breaking of existing queries on such tables
>
> I want to understand this statement from code's perspective. According to
> what I know, in HoodieAvroUtils class, we are trying to validate the
> rewritten record against Avro schema as given below -
>
> private static GenericRecord rewrite(GenericRecord record, Schema
> schemaWithFields, Schema newSchema) {
>   GenericRecord newRecord = new GenericData.Record(newSchema);
>   for (Schema.Field f : schemaWithFields.getFields()) {
>     newRecord.put(f.name(), record.get(f.name()));
>   }
>   if (!GenericData.get().validate(newSchema, newRecord)) {
>     throw new SchemaCompatabilityException(
>         "Unable to validate the rewritten record " + record + "
> against schema " + newSchema);
>   }
>   return newRecord;
> }
>
> So I am trying to understand is there any place where we are actually
> checking the compatibility of writer's and reader's schema in our code? The
> above function simply validates the data types of field values and checks
> if they are non-null. Also can someone explain the reason behind doing the
> above validation? The record coming here gets created with original target
> schema and newSchema simply includes hoodie metadata fields. So I feel this
> check is redundant.
>
> On Fri, Feb 7, 2020 at 2:08 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > @Vinoth Chandar <[email protected]> How does re-ordering affect here
> like
> > you mentioned? Parquet files access fields by name rather than by index
> by
> > default. So re-ordering should not matter. Please help me understand.
> >
> > On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <[email protected]>
> wrote:
> >
> >> @Pratyaksh Sharma <[email protected]> Please go ahead :)
> >>
> >> @Benoit , you are right about Parquet deletion, I think.
> >>
> >> Come to think of it, with an initial schema in place, how would we even
> >> drop a field? all of the old data needs to be rewritten (prohibitively
> >> expensive)? So all we will end up doing is simply mask the field from
> >> queries by mapping old data to the current schema?  This can get messy
> >> pretty quickly if field re-ordering is allowed for e.g.. What we
> do/advise
> >> now is to alternatively embrace a more brittle schema management at the
> >> write side (no renames, no dropping fields, all fields are nullable) and
> >> ensure reader schema is simpler to manage..  There is probably a
> >> middle-ground here somewhere/
> >>
> >>
> >>
> >> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <[email protected]
> >
> >> wrote:
> >>
> >>> @Vinoth Chandar <[email protected]> I would like to drive this.
> >>>
> >>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <[email protected]>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I think deleting field is supported with Avro both backward and
> forward
> >>>> as long as the field is optional  and provide default value.
> >>>>
> >>>> A simple exemple of Avro optional field defined using a union type and
> >>>> a default value:
> >>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
> >>>> Readers will use default value when field is not present.
> >>>>
> >>>> I believe problem here is Parquet which does not support field
> >>>> deletion.
> >>>> One option is to set Parquet field value to null. Parquet will use RLE
> >>>> encoding for efficient encoding of all null values in "deleted" field.
> >>>>
> >>>> Regards,
> >>>> Benoit
> >>>>
> >>>> > On 6 Feb 2020, at 17:57, Nishith <[email protected]> wrote:
> >>>> >
> >>>>
> >>>> Pratakysh,
> >>>>
> >>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
> >>>> Avro schema evolution rules which helps to prevent breaking of
> existing
> >>>> queries on such tables - say someone was querying that field that is
> now
> >>>> deleted.
> >>>> You can read more here ->
> https://avro.apache.org/docs/1.8.2/spec.html
> >>>> That being said, I’m also looking at how we can support schema
> >>>> evolution slightly differently - somethings could be more in our
> control
> >>>> and not break reader queries - but that’s not in the near future.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <
> [email protected]>
> >>>> wrote:
> >>>> >
> >>>> > Hi Vinoth,
> >>>> >
> >>>> > We do not have any standard documentation for the said approach as
> it
> >>>> was
> >>>> > self thought through. Just logging a conversation from #general
> >>>> channel for
> >>>> > record purpose -
> >>>> >
> >>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but
> >>>> I got
> >>>> > an error and I didnt find any solution for this... I wrote some
> >>>> parquet
> >>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
> >>>> MOR_STORAGE_TYPE_OPT_VAL
> >>>> > and sync with hive and worked perfectly. But after that, I try to
> >>>> wrote
> >>>> > another file in the same table (with some schema changes, just
> delete
> >>>> and
> >>>> > add some columns) and got this error Caused by:
> >>>> > org.apache.parquet.io.InvalidRecordException:
> >>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone
> >>>> know
> >>>> > what to do?"
> >>>> >
> >>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]>
> >>>> wrote:
> >>>> >>
> >>>> >> In my experience, you need to follow some rules on evolving and
> keep
> >>>> the
> >>>> >> data backwards compatible. Or the only other option is to rewrite
> the
> >>>> >> entire dataset :), which is very expensive.
> >>>> >>
> >>>> >> If you have some pointers to learn more about any approach you are
> >>>> >> suggesting, happy to read up.
> >>>> >>
> >>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
> >>>> [email protected]>
> >>>> >> wrote:
> >>>> >>
> >>>> >>> Hi Vinoth,
> >>>> >>>
> >>>> >>> As you explained above and as per what is mentioned in this FAQ (
> >>>> >>>
> >>>> >>>
> >>>> >>
> >>>>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> >>>> >>> ),
> >>>> >>> Hudi is able to maintain schema evolution only if the schema is
> >>>> >> *backwards
> >>>> >>> compatible*. What about the case when it is backwards
> incompatible?
> >>>> This
> >>>> >>> might be the case when for some reason you are unable to enforce
> >>>> things
> >>>> >>> like not deleting fields or not change the order. Ideally we
> should
> >>>> be
> >>>> >> full
> >>>> >>> proof and be able to support schema evolution in every case
> >>>> possible. In
> >>>> >>> such a case, creating a Uber schema can be useful. WDYT?
> >>>> >>>
> >>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected]
> >
> >>>> >> wrote:
> >>>> >>>
> >>>> >>>> Hi Syed,
> >>>> >>>>
> >>>> >>>> Typically, I have been the Confluent/avro schema registry used as
> >>>> a the
> >>>> >>>> source of truth and Hive schema is just a translation. Thats how
> >>>> the
> >>>> >>>> hudi-hive sync also works..
> >>>> >>>> Have you considered making fields optional in the avro schema so
> >>>> that
> >>>> >>> even
> >>>> >>>> if the source data does not have few of them, there will be
> nulls..
> >>>> >>>> In general, the two places I have dealt with this, all made it
> >>>> works
> >>>> >>> using
> >>>> >>>> the schema evolution rules avro supports.. and enforcing things
> >>>> like
> >>>> >> not
> >>>> >>>> deleting fields, not changing order etc.
> >>>> >>>>
> >>>> >>>> Hope that atleast helps a bit
> >>>> >>>>
> >>>> >>>> thanks
> >>>> >>>> vinoth
> >>>> >>>>
> >>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
> >>>> [email protected]
> >>>> >>>
> >>>> >>>> wrote:
> >>>> >>>>
> >>>> >>>>> Hi Team,
> >>>> >>>>>
> >>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
> >>>> >>> maintained
> >>>> >>>>> in the schema registry by confluent framework during the
> >>>> population
> >>>> >> of
> >>>> >>>>> data.
> >>>> >>>>>
> >>>> >>>>> *Problem Statement Here: *
> >>>> >>>>>
> >>>> >>>>> All the addition/deletion of columns is maintained in schema
> >>>> >> registry.
> >>>> >>>>> During running the Hudi pipeline, We have custom schema registry
> >>>> >> that
> >>>> >>>>> pulls the latest schema from the schema registry as well as from
> >>>> hive
> >>>> >>>>> metastore and we create a uber schema (so that missing the
> columns
> >>>> >> from
> >>>> >>>> the
> >>>> >>>>> schema registry will be pulled from hive metastore) But is there
> >>>> any
> >>>> >>>> better
> >>>> >>>>> approach to solve this problem?.
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>>>>           Thanks and Regards,
> >>>> >>>>>       S SYED ABDUL KATHER
> >>>> >>>>>
> >>>> >>>>
> >>>> >>>
> >>>> >>
> >>>>
> >>>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Reply via email to