TonyCui02 opened a new issue, #13585: URL: https://github.com/apache/hudi/issues/13585
**Describe the problem you faced** In [AverageRecordSizeEstimator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java), the [PARQUET_SMALL_FILE_LIMIT](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L103) threshold is applied inconsistently between regular commits and delta commits. For delta commits, it filters individual files against the threshold, while for regular commits, it checks the aggregate commit size. Our team noticed this discrepancy while planning an upgrade. **Expected behavior** The threshold check should be consistent between both commit types. Since the [class comment](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java#L49) indicates that "Candidate files are selective files that have a threshold size to avoid measurement errors", this suggests it should be applied at individual file level for both commit types. Checking against total commit size only partially addresses the issue. Large commits containing numerous small files still result in skewed record size estimation due to metadata dominance. The current implementation uses identical threshold values for both file-level and commit-level filtering, creating a fundamental scale mismatch. While commit-level filtering helps control overall processing, it doesn't solve the core problem of metadata overhead in individual files. A threshold suitable for individual file sizes proves inadequate for commit-level filtering, and vice versa. This mismatch undermines the effectiveness of the size estimation process. One potential approach to resolve these issues is implementing filtering at both commit and file levels with separate threshold configurations. We can retain `PARQUET_SMALL_FILE_LIMIT` for commit-level filtering while introducing a new configuration parameter specifically for file-level filtering. This decoupling from `PARQUET_SMALL_FILE_LIMIT` (which currently also manages small file expansion on storage) will provide better control over estimation bias while maintaining efficient commit processing. **Additional context** Relevant code snippet showing the inconsistency: ``` // Delta commits - checks individual files if (instant.getAction().equals(DELTA_COMMIT_ACTION)) { commitMetadata.getWriteStats().stream() .filter(hoodieWriteStat -> FSUtils.isBaseFile(...)) .forEach(hoodieWriteStat -> averageRecordSizeStats.updateStats(hoodieWriteStat.getTotalWriteBytes(), hoodieWriteStat.getNumWrites())); } else { // Regular commits - checks aggregate size averageRecordSizeStats.updateStats(commitMetadata.fetchTotalBytesWritten(), commitMetadata.fetchTotalRecordsWritten()); } private void updateStats(long fileSizeInBytes, long recordWritten) { if (fileSizeInBytes > fileSizeThreshold && fileSizeInBytes > avgMetadataSize && recordWritten > 0) { totalBytesWritten.add(fileSizeInBytes - avgMetadataSize); totalRecordsWritten.add(recordWritten); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
