Prashant Wason created HUDI-1357:
------------------------------------

             Summary: Add a check to ensure there is no data loss when writing 
to HUDI dataset
                 Key: HUDI-1357
                 URL: https://issues.apache.org/jira/browse/HUDI-1357
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason


When updating a HUDI dataset with updates + deletes, records from existing base 
files are read and merged with updates+deletes and finally written to newer 
base files.

It should hold that:

count(records_in_older_base file) + num_deletes = count(records_in_new_base 
file)

In our internal production deployment, we had an issue wherein due to parquet 
bug in handling the schema, reading existing records returned null data. This 
lead to many records not being written out from older parquet into newer 
parquet file.

This check will ensure that such issues do not lead to data loss by triggering 
an exception when the expected record counts do not match. This check is off by 
default and controlled through a HoodieWriteConfig parameter.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to