nbalajee commented on a change in pull request #2216:
URL: https://github.com/apache/hudi/pull/2216#discussion_r524813805
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java
##########
@@ -49,6 +49,12 @@
*/
private String prevCommit;
+ /**
+ * Total number of records written to the previous version of the file slice.
+ * If inflight commit is c2, then number of records present in
f1_w1_c1.parquet.
+ */
+ private long oldNumWrites;
Review comment:
@vinothchandar - This check can be toggled on/off on a per table basis.
When debugging an actual incident, where the commit metadata files have been
archived, older snapshots of data files have been cleaned up by the cleaner,
having "oldNumWrites" metadata is of great help in identifying the instance
that resulted in a smaller parquet file by inspecting the archived commit
metadata (Especially when the check is turned off and wouldn't throw an
exception).
For example, if the file were to evolve from f1_c1.parquet to
f1_c10.parquet, then without this "oldNumWrites" information, we have hunt down
all the the older archived commits, in search of the commit that had touched
the data file.
@prashantwason - in performDatalossCheck(), we should move this out side of
the isDatalossCheckEnabled() condition, so that we will record this info,
irrespective of whether the flag is enabled or not.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]