jonvex commented on code in PR #14039:
URL: https://github.com/apache/hudi/pull/14039#discussion_r2399955569
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java:
##########
@@ -66,68 +65,44 @@ public AverageRecordSizeEstimator(HoodieWriteConfig
writeConfig) {
@Override
public long averageBytesPerRecord(HoodieTimeline commitTimeline,
CommitMetadataSerDe commitMetadataSerDe) {
int maxCommits = hoodieWriteConfig.getRecordSizeEstimatorMaxCommits();
- final AverageRecordSizeStats averageRecordSizeStats = new
AverageRecordSizeStats(hoodieWriteConfig);
+ final long commitSizeThreshold = (long)
(hoodieWriteConfig.getRecordSizeEstimationThreshold() *
hoodieWriteConfig.getParquetSmallFileLimit());
Review Comment:
https://github.com/apache/hudi/pull/10763 it seems to have always been this
case that it's for the entire commit. Additionally, the config description is
```
public static final ConfigProperty<String>
RECORD_SIZE_ESTIMATION_THRESHOLD = ConfigProperty
.key("hoodie.record.size.estimation.threshold")
.defaultValue("1.0")
.markAdvanced()
.withDocumentation("We use the previous commits' metadata to calculate
the estimated record size and use it "
+ " to bin pack records into partitions. If the previous commit is
too small to make an accurate estimation, "
+ " Hudi will search commits in the reverse order, until we find a
commit that has totalBytesWritten "
+ " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES *
this_threshold)");
```
and the git blame is from 2021
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]