[
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081025#comment-17081025
]
Prashant Wason commented on HUDI-741:
-------------------------------------
HUDI requires a Schema to be specified in HoodieWriteConfig and is used by the
HoodieWriteClient to
create the records. The schema is also saved in the data files (parquet format)
and log files (avro format).
Since a schema is required each time new data is ingested into a HUDI dataset,
schema can be evolved over time.
HUDI specific validation of schema evolution should ensure that a newer schema
can be used for the dataset by
checking that the data written using the old schema can be read using the new
schema.
New Schema is compatible only if:
A1. There is no change in schema
A2. A field has been added and it has a default value specified
New Schema is incompatible if:
B1. A field has been deleted
B2. A field has been renamed (treated as delete + add)
B3. A field's type has changed to be incompatible with the older type
*Limitation with org.apache.avro.SchemaCompatibility:*
org.apache.avro.SchemaCompatibility checks schema compatibility between a
writer schema (which originally wrote the
AVRO record) and a readerSchema (with which we are reading the record). It
ONLY guarantees that that each field in
the reader record can be populated from the writer record. Hence, if the
reader schema is missing a field, it is
still compatible with the writer schema.
In other words, org.apache.avro.SchemaCompatibility was written to guarantee
that we can read the data written
earlier. It does not guarantee schema evolution for HUDI (B1 above).
> Fix Hoodie's schema evolution checks
> ------------------------------------
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Minor
> Labels: pull-request-available
> Original Estimate: 120h
> Time Spent: 10m
> Remaining Estimate: 119h 50m
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by
> the HoodieWriteClient to create the records. The schema is also saved in the
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI
> dataset, schema can be evolved over time. But HUDI should ensure that the
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer
> schema can be used for the dataset by checking that the data written using
> the old schema can be read using the new schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)