Re: [PR] fix: fix record size estimation to reflect previous behavior [hudi]

via GitHub Thu, 02 Oct 2025 13:15:38 -0700


jonvex commented on code in PR #14039:
URL: https://github.com/apache/hudi/pull/14039#discussion_r2399955569



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java:
##########
@@ -66,68 +65,44 @@ public AverageRecordSizeEstimator(HoodieWriteConfig 
writeConfig) {
   @Override
   public long averageBytesPerRecord(HoodieTimeline commitTimeline, 
CommitMetadataSerDe commitMetadataSerDe) {
     int maxCommits = hoodieWriteConfig.getRecordSizeEstimatorMaxCommits();
-    final AverageRecordSizeStats averageRecordSizeStats = new 
AverageRecordSizeStats(hoodieWriteConfig);
+    final long commitSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());

Review Comment:
   https://github.com/apache/hudi/pull/10763 it seems to have always been this 
case that it's for the entire commit. Additionally, the config description is 
   ```
     public static final ConfigProperty<String> 
RECORD_SIZE_ESTIMATION_THRESHOLD = ConfigProperty
         .key("hoodie.record.size.estimation.threshold")
         .defaultValue("1.0")
         .markAdvanced()
         .withDocumentation("We use the previous commits' metadata to calculate 
the estimated record size and use it "
             + " to bin pack records into partitions. If the previous commit is 
too small to make an accurate estimation, "
             + " Hudi will search commits in the reverse order, until we find a 
commit that has totalBytesWritten "
             + " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * 
this_threshold)");
   ```
   and the git blame is from 2021



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: fix record size estimation to reflect previous behavior [hudi]

Reply via email to