vinothchandar commented on a change in pull request #2216:
URL: https://github.com/apache/hudi/pull/2216#discussion_r520939276
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java
##########
@@ -49,6 +49,12 @@
*/
private String prevCommit;
+ /**
+ * Total number of records written to the previous version of the file slice.
+ * If inflight commit is c2, then number of records present in
f1_w1_c1.parquet.
+ */
+ private long oldNumWrites;
Review comment:
as far as I can tell, we only use this within HoodieMergeHandle. Can we
avoid adding the extra member here and simply use a local variable? I am trying
to understand the use-case for logging this in stat.
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##########
@@ -117,6 +117,10 @@
public static final String MAX_CONSISTENCY_CHECKS_PROP =
"hoodie.consistency.check.max_checks";
public static int DEFAULT_MAX_CONSISTENCY_CHECKS = 7;
+ // Data loss check before commits
+ private static final String DATALOSS_CHECK_ENABLED =
"hoodie.dataloss.check.enabled";
Review comment:
let's name this specific to real purpose like.
`hoodie.merge.data.validation.enabled` , avoiding the calling this loss
checking etc, which can be rather disconcerting to users, when they read this.
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
##########
@@ -261,6 +262,22 @@ public static BloomFilter
readBloomFilterFromParquetMetadata(Configuration confi
return records;
}
+ /**
+ * Returns the number of records in the parquet file.
+ *
+ * @param conf Configuration
+ * @param parquetFilePath path of the file
+ */
+ public static long getRowCount(Configuration conf, Path parquetFilePath) {
+ ParquetMetadata footer;
+ long rowCount = 0;
+ footer = readMetadata(conf, parquetFilePath);
Review comment:
sweet. I was going to suggest this. you are ahead!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]