[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

Prashant Wason (Jira) Fri, 27 Mar 2020 00:24:10 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068374#comment-17068374
 ]


Prashant Wason commented on HUDI-741:
-------------------------------------

Background
------------------

When a record is read from the parquet file, it uses the schema stored in the 
parquet file. This should always succeed as the record was written using the 
same schema (stored in the footer of the parquet file). The same is true for 
records read from the LOG files which also save their AVRO schema.

Once the record is read, it is converted into a GenericRecord with the 
writerSchema (schema provided to the HoodieWriteConfig). This step fails 
(raises exception) if the writerSchema (evolved) is incompatible with the 
schema in the parquet file.  

 

Checking schema compatibility
-----------------------------------

org.apache.avro.SchemaCompatibility is a class which compares two AVRO schemas 
to ensure they are compatible. It has the concept of "reader" and "writer" 
schemas. 
 - writer schema: the schema with which the avro record was serialized into 
bytes
 - reader schema: the schema using which we are reading the serialized bytes

A reader schema is deemed compatible with the writer schema if all fields of 
the reader schema can be populated from the writer schema. Hence, this will 
consider the following two schemas as compatible:

 

Writer Schema

{{{}}
{{ "type": "record",}}
{{ "name": "triprec",}}
{{ "fields": [{}}
{{       "name": "_hoodie_commit_time",}}
{{       "type": "string",}}
{{       }, {}}
{{        "name": "_hoodie_commit_seqno",}}
{{        "type": "string",}}
{{}]}}}

 

Reader Schema

{{{}}
{{ "type": "record",}}
{{ "name": "triprec",}}
{{ "fields": [{}}
{{       "name": "_hoodie_commit_time",}}
{{       "type": "string",}}
{{       }}}
{{}]}}}

 

When reading the bytes using the above reader schema, one will get a record 
which will be missing the {{_hoodie_commit_seqno }}field but there wont be any 
exception.

If HUDI was to use the Reader Schema as an "evolved" schema (say due to a bug) 
then there can be inadvertent data corruption/loss as conversion will fail.

So it is not possible to use org.apache.avro.SchemaCompatibility directly to 
perform the schema compatibility check. This class needs to be modified to 
implement this check.

 

> Fix Hoodie's schema evolution checks
> ------------------------------------
>
>                 Key: HUDI-741
>                 URL: https://issues.apache.org/jira/browse/HUDI-741
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Prashant Wason
>            Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

Reply via email to