Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Vinoth Chandar Thu, 06 Feb 2020 10:33:20 -0800

This is a good topic for a RFC.. Like the one Nishith wrote for indexing,
something to gather ideas and share a broader perspective on..


Anyone interested in driving this?

On Thu, Feb 6, 2020 at 8:57 AM Nishith <[email protected]> wrote:

> Pratakysh,
>
> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
> Avro schema evolution rules which helps to prevent breaking of existing
> queries on such tables - say someone was querying that field that is now
> deleted.
> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
> That being said, I’m also looking at how we can support schema evolution
> slightly differently - somethings could be more in our control and not
> break reader queries - but that’s not in the near future.
>
> Thanks
>
> Sent from my iPhone
>
> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <[email protected]>
> wrote:
> >
> > Hi Vinoth,
> >
> > We do not have any standard documentation for the said approach as it was
> > self thought through. Just logging a conversation from #general channel
> for
> > record purpose -
> >
> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but I
> got
> > an error and I didnt find any solution for this... I wrote some parquet
> > files with HUDI using INSERT_OPERATION_OPT_VAL, MOR_STORAGE_TYPE_OPT_VAL
> > and sync with hive and worked perfectly. But after that, I try to wrote
> > another file in the same table (with some schema changes, just delete and
> > add some columns) and got this error Caused by:
> > org.apache.parquet.io.InvalidRecordException:
> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone know
> > what to do?"
> >
> >> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]>
> wrote:
> >>
> >> In my experience, you need to follow some rules on evolving and keep the
> >> data backwards compatible. Or the only other option is to rewrite the
> >> entire dataset :), which is very expensive.
> >>
> >> If you have some pointers to learn more about any approach you are
> >> suggesting, happy to read up.
> >>
> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <[email protected]
> >
> >> wrote:
> >>
> >>> Hi Vinoth,
> >>>
> >>> As you explained above and as per what is mentioned in this FAQ (
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
> >>> ),
> >>> Hudi is able to maintain schema evolution only if the schema is
> >> *backwards
> >>> compatible*. What about the case when it is backwards incompatible?
> This
> >>> might be the case when for some reason you are unable to enforce things
> >>> like not deleting fields or not change the order. Ideally we should be
> >> full
> >>> proof and be able to support schema evolution in every case possible.
> In
> >>> such a case, creating a Uber schema can be useful. WDYT?
> >>>
> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected]>
> >> wrote:
> >>>
> >>>> Hi Syed,
> >>>>
> >>>> Typically, I have been the Confluent/avro schema registry used as a
> the
> >>>> source of truth and Hive schema is just a translation. Thats how the
> >>>> hudi-hive sync also works..
> >>>> Have you considered making fields optional in the avro schema so that
> >>> even
> >>>> if the source data does not have few of them, there will be nulls..
> >>>> In general, the two places I have dealt with this, all made it works
> >>> using
> >>>> the schema evolution rules avro supports.. and enforcing things like
> >> not
> >>>> deleting fields, not changing order etc.
> >>>>
> >>>> Hope that atleast helps a bit
> >>>>
> >>>> thanks
> >>>> vinoth
> >>>>
> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
> [email protected]
> >>>
> >>>> wrote:
> >>>>
> >>>>> Hi Team,
> >>>>>
> >>>>> We have pull data from Kafka generated by Debezium. The schema
> >>> maintained
> >>>>> in the schema registry by confluent framework during the population
> >> of
> >>>>> data.
> >>>>>
> >>>>> *Problem Statement Here: *
> >>>>>
> >>>>> All the addition/deletion of columns is maintained in schema
> >> registry.
> >>>>> During running the Hudi pipeline, We have custom schema registry
> >> that
> >>>>> pulls the latest schema from the schema registry as well as from hive
> >>>>> metastore and we create a uber schema (so that missing the columns
> >> from
> >>>> the
> >>>>> schema registry will be pulled from hive metastore) But is there any
> >>>> better
> >>>>> approach to solve this problem?.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>            Thanks and Regards,
> >>>>>        S SYED ABDUL KATHER
> >>>>>
> >>>>
> >>>
> >>
>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Reply via email to