[
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068356#comment-17068356
]
Prashant Wason commented on HUDI-741:
-------------------------------------
Inserts and Updates to the HUDI table should validate that the current
writeSchema is compatible.
A new schema can be incompatible with older schema for various reasons:
1. A schema field deleted (intentionally or due to a bug)
2. A schema field added but does not have a default value
3. A schema's field type changed (e.g. string to int)
Allowing data ingestion using such an incompatible schema should not be allowed
as it will effect reading of data as well as future ingestion (e.g. after the
buggy schema is reverted).
Current issues:
For COW tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there
is no existing parquet files, no merge is done)
For MOR tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there
is no existing parquet files, no merge is done)
2. Inserts to a new partition with incompatible-schema is allowed (a LOG file
may be created with HoodieAvroDataBlock)
3. Appends to an existing LOG file with incompatible-schema is allowed (a new
HoodieAvroDataBlock is added)
4. Updates with incompatible-schema is allowed (a new HoodieAvroDataBlock is
added)
> Fix Hoodie's schema evolution checks
> ------------------------------------
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Prashant Wason
> Priority: Minor
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by
> the HoodieWriteClient to create the records. The schema is also saved in the
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI
> dataset, schema can be evolved over time. But HUDI should ensure that the
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer
> schema can be used for the dataset by checking that the data written using
> the old schema can be read using the new schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)