nbalajee commented on a change in pull request #2216:
URL: https://github.com/apache/hudi/pull/2216#discussion_r524813805



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java
##########
@@ -49,6 +49,12 @@
    */
   private String prevCommit;
 
+  /**
+   * Total number of records written to the previous version of the file slice.
+   * If inflight commit is c2, then number of records present in 
f1_w1_c1.parquet.
+   */
+  private long oldNumWrites;

Review comment:
       @vinothchandar - This check can be toggled on/off on a per table basis.  
When debugging an actual incident,  where the commit metadata files have been 
archived, older snapshots of data files have been cleaned up by the cleaner,  
having "oldNumWrites" metadata is of great help in identifying the instance 
that resulted in a smaller parquet file by inspecting the archived commit 
metadata (Especially when the check is turned off and wouldn't throw an 
exception).   
   
   For example, if the file were to evolve  from f1_c1.parquet to 
f1_c10.parquet, then without this "oldNumWrites" information, we have hunt down 
all the the older archived commits, in search of the commit that hat touched 
the data file. 
   
   @prashantwason  - in performDatalossCheck(), we should move this out side of 
the isDatalossCheckEnabled() condition, so that we will record this info, 
irrespective of whether the flag is enabled or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to