[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068356#comment-17068356
 ] 

Prashant Wason commented on HUDI-741:
-------------------------------------

Inserts and Updates to the HUDI table should validate that the current 
writeSchema is compatible.

A new schema can be incompatible with older schema for various reasons:
1. A schema field deleted (intentionally or due to a bug)
2. A schema field added but does not have a default value
3. A schema's field type changed (e.g. string to int) 



Allowing data ingestion using such an incompatible schema should not be allowed 
as it will effect reading of data as well as future ingestion (e.g. after the 
buggy schema is reverted).

 

Current issues:

For COW tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there 
is no existing parquet files, no merge is done)

For MOR tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there 
is no existing parquet files, no merge is done)
2. Inserts to a new partition with incompatible-schema is allowed (a LOG file 
may be created with HoodieAvroDataBlock)
3. Appends to an existing LOG file with incompatible-schema is allowed (a new 
HoodieAvroDataBlock is added)
4. Updates with incompatible-schema is allowed (a new HoodieAvroDataBlock is 
added)

 

 

 

> Fix Hoodie's schema evolution checks
> ------------------------------------
>
>                 Key: HUDI-741
>                 URL: https://issues.apache.org/jira/browse/HUDI-741
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Prashant Wason
>            Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to