Prashant Wason created HUDI-1357:
------------------------------------
Summary: Add a check to ensure there is no data loss when writing
to HUDI dataset
Key: HUDI-1357
URL: https://issues.apache.org/jira/browse/HUDI-1357
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
When updating a HUDI dataset with updates + deletes, records from existing base
files are read and merged with updates+deletes and finally written to newer
base files.
It should hold that:
count(records_in_older_base file) + num_deletes = count(records_in_new_base
file)
In our internal production deployment, we had an issue wherein due to parquet
bug in handling the schema, reading existing records returned null data. This
lead to many records not being written out from older parquet into newer
parquet file.
This check will ensure that such issues do not lead to data loss by triggering
an exception when the expected record counts do not match. This check is off by
default and controlled through a HoodieWriteConfig parameter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)