Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Pratyaksh Sharma Fri, 07 Feb 2020 01:24:48 -0800

@Nishith

>> Hudi relies on Avro schema evolution rules which helps to prevent
breaking of existing queries on such tables


I want to understand this statement from code's perspective. According to
what I know, in HoodieAvroUtils class, we are trying to validate the
rewritten record against Avro schema as given below -

private static GenericRecord rewrite(GenericRecord record, Schema
schemaWithFields, Schema newSchema) {
  GenericRecord newRecord = new GenericData.Record(newSchema);
  for (Schema.Field f : schemaWithFields.getFields()) {
    newRecord.put(f.name(), record.get(f.name()));
  }
  if (!GenericData.get().validate(newSchema, newRecord)) {
    throw new SchemaCompatabilityException(
        "Unable to validate the rewritten record " + record + "
against schema " + newSchema);
  }
  return newRecord;
}

So I am trying to understand is there any place where we are actually
checking the compatibility of writer's and reader's schema in our code? The
above function simply validates the data types of field values and checks
if they are non-null. Also can someone explain the reason behind doing the
above validation? The record coming here gets created with original target
schema and newSchema simply includes hoodie metadata fields. So I feel this
check is redundant.

On Fri, Feb 7, 2020 at 2:08 PM Pratyaksh Sharma <[email protected]>
wrote:

> @Vinoth Chandar <[email protected]> How does re-ordering affect here like
> you mentioned? Parquet files access fields by name rather than by index by
> default. So re-ordering should not matter. Please help me understand.
>
> On Fri, Feb 7, 2020 at 11:53 AM Vinoth Chandar <[email protected]> wrote:
>
>> @Pratyaksh Sharma <[email protected]> Please go ahead :)
>>
>> @Benoit , you are right about Parquet deletion, I think.
>>
>> Come to think of it, with an initial schema in place, how would we even
>> drop a field? all of the old data needs to be rewritten (prohibitively
>> expensive)? So all we will end up doing is simply mask the field from
>> queries by mapping old data to the current schema?  This can get messy
>> pretty quickly if field re-ordering is allowed for e.g.. What we do/advise
>> now is to alternatively embrace a more brittle schema management at the
>> write side (no renames, no dropping fields, all fields are nullable) and
>> ensure reader schema is simpler to manage..  There is probably a
>> middle-ground here somewhere/
>>
>>
>>
>> On Thu, Feb 6, 2020 at 12:10 PM Pratyaksh Sharma <[email protected]>
>> wrote:
>>
>>> @Vinoth Chandar <[email protected]> I would like to drive this.
>>>
>>> On Fri, Feb 7, 2020 at 1:08 AM Benoit Rousseau <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think deleting field is supported with Avro both backward and forward
>>>> as long as the field is optional  and provide default value.
>>>>
>>>> A simple exemple of Avro optional field defined using a union type and
>>>> a default value:
>>>> { "name": "foo", "type": ["null", "string"], "default": "null" }
>>>> Readers will use default value when field is not present.
>>>>
>>>> I believe problem here is Parquet which does not support field
>>>> deletion.
>>>> One option is to set Parquet field value to null. Parquet will use RLE
>>>> encoding for efficient encoding of all null values in "deleted" field.
>>>>
>>>> Regards,
>>>> Benoit
>>>>
>>>> > On 6 Feb 2020, at 17:57, Nishith <[email protected]> wrote:
>>>> >
>>>>
>>>> Pratakysh,
>>>>
>>>> Deleting fields isn’t Avro schema backwards compatible. Hudi relies on
>>>> Avro schema evolution rules which helps to prevent breaking of existing
>>>> queries on such tables - say someone was querying that field that is now
>>>> deleted.
>>>> You can read more here -> https://avro.apache.org/docs/1.8.2/spec.html
>>>> That being said, I’m also looking at how we can support schema
>>>> evolution slightly differently - somethings could be more in our control
>>>> and not break reader queries - but that’s not in the near future.
>>>>
>>>> Thanks
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On Feb 5, 2020, at 11:22 PM, Pratyaksh Sharma <[email protected]>
>>>> wrote:
>>>> >
>>>> > Hi Vinoth,
>>>> >
>>>> > We do not have any standard documentation for the said approach as it
>>>> was
>>>> > self thought through. Just logging a conversation from #general
>>>> channel for
>>>> > record purpose -
>>>> >
>>>> > "Hello people, I'm doing a POC to use HUDI in our data pipeline, but
>>>> I got
>>>> > an error and I didnt find any solution for this... I wrote some
>>>> parquet
>>>> > files with HUDI using INSERT_OPERATION_OPT_VAL,
>>>> MOR_STORAGE_TYPE_OPT_VAL
>>>> > and sync with hive and worked perfectly. But after that, I try to
>>>> wrote
>>>> > another file in the same table (with some schema changes, just delete
>>>> and
>>>> > add some columns) and got this error Caused by:
>>>> > org.apache.parquet.io.InvalidRecordException:
>>>> > Parquet/Avro schema mismatch: Avro field 'field' not found. Anyone
>>>> know
>>>> > what to do?"
>>>> >
>>>> >>> On Sun, Jan 5, 2020 at 2:00 AM Vinoth Chandar <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> In my experience, you need to follow some rules on evolving and keep
>>>> the
>>>> >> data backwards compatible. Or the only other option is to rewrite the
>>>> >> entire dataset :), which is very expensive.
>>>> >>
>>>> >> If you have some pointers to learn more about any approach you are
>>>> >> suggesting, happy to read up.
>>>> >>
>>>> >> On Wed, Jan 1, 2020 at 10:26 PM Pratyaksh Sharma <
>>>> [email protected]>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi Vinoth,
>>>> >>>
>>>> >>> As you explained above and as per what is mentioned in this FAQ (
>>>> >>>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-What'sHudi'sschemaevolutionstory
>>>> >>> ),
>>>> >>> Hudi is able to maintain schema evolution only if the schema is
>>>> >> *backwards
>>>> >>> compatible*. What about the case when it is backwards incompatible?
>>>> This
>>>> >>> might be the case when for some reason you are unable to enforce
>>>> things
>>>> >>> like not deleting fields or not change the order. Ideally we should
>>>> be
>>>> >> full
>>>> >>> proof and be able to support schema evolution in every case
>>>> possible. In
>>>> >>> such a case, creating a Uber schema can be useful. WDYT?
>>>> >>>
>>>> >>> On Wed, Jan 1, 2020 at 12:49 AM Vinoth Chandar <[email protected]>
>>>> >> wrote:
>>>> >>>
>>>> >>>> Hi Syed,
>>>> >>>>
>>>> >>>> Typically, I have been the Confluent/avro schema registry used as
>>>> a the
>>>> >>>> source of truth and Hive schema is just a translation. Thats how
>>>> the
>>>> >>>> hudi-hive sync also works..
>>>> >>>> Have you considered making fields optional in the avro schema so
>>>> that
>>>> >>> even
>>>> >>>> if the source data does not have few of them, there will be nulls..
>>>> >>>> In general, the two places I have dealt with this, all made it
>>>> works
>>>> >>> using
>>>> >>>> the schema evolution rules avro supports.. and enforcing things
>>>> like
>>>> >> not
>>>> >>>> deleting fields, not changing order etc.
>>>> >>>>
>>>> >>>> Hope that atleast helps a bit
>>>> >>>>
>>>> >>>> thanks
>>>> >>>> vinoth
>>>> >>>>
>>>> >>>> On Sun, Dec 29, 2019 at 11:55 PM Syed Abdul Kather <
>>>> [email protected]
>>>> >>>
>>>> >>>> wrote:
>>>> >>>>
>>>> >>>>> Hi Team,
>>>> >>>>>
>>>> >>>>> We have pull data from Kafka generated by Debezium. The schema
>>>> >>> maintained
>>>> >>>>> in the schema registry by confluent framework during the
>>>> population
>>>> >> of
>>>> >>>>> data.
>>>> >>>>>
>>>> >>>>> *Problem Statement Here: *
>>>> >>>>>
>>>> >>>>> All the addition/deletion of columns is maintained in schema
>>>> >> registry.
>>>> >>>>> During running the Hudi pipeline, We have custom schema registry
>>>> >> that
>>>> >>>>> pulls the latest schema from the schema registry as well as from
>>>> hive
>>>> >>>>> metastore and we create a uber schema (so that missing the columns
>>>> >> from
>>>> >>>> the
>>>> >>>>> schema registry will be pulled from hive metastore) But is there
>>>> any
>>>> >>>> better
>>>> >>>>> approach to solve this problem?.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>           Thanks and Regards,
>>>> >>>>>       S SYED ABDUL KATHER
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>>
>>>

Re: Regards to Uber Schema Registry ( Hive Schema + Schema Registry )

Reply via email to