TonyCui02 opened a new issue, #13585:
URL: https://github.com/apache/hudi/issues/13585

   **Describe the problem you faced**
   
   In 
[AverageRecordSizeEstimator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java),
 the 
[PARQUET_SMALL_FILE_LIMIT](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L103)
 threshold is applied inconsistently between regular commits and delta commits. 
For delta commits, it filters individual files against the threshold, while for 
regular commits, it checks the aggregate commit size. Our team noticed this 
discrepancy while planning an upgrade.
   
   **Expected behavior**
   
   The threshold check should be consistent between both commit types. Since 
the [class 
comment](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java#L49)
 indicates that "Candidate files are selective files that have a threshold size 
to avoid measurement errors", this suggests it should be applied at individual 
file level for both commit types.
   
   Checking against total commit size only partially addresses the issue. Large 
commits containing numerous small files still result in skewed record size 
estimation due to metadata dominance.
   
   The current implementation uses identical threshold values for both 
file-level and commit-level filtering, creating a fundamental scale mismatch. 
While commit-level filtering helps control overall processing, it doesn't solve 
the core problem of metadata overhead in individual files. A threshold suitable 
for individual file sizes proves inadequate for commit-level filtering, and 
vice versa. This mismatch undermines the effectiveness of the size estimation 
process.
   
   One potential approach to resolve these issues is implementing filtering at 
both commit and file levels with separate threshold configurations. We can 
retain `PARQUET_SMALL_FILE_LIMIT` for commit-level filtering while introducing 
a new configuration parameter specifically for file-level filtering. This 
decoupling from `PARQUET_SMALL_FILE_LIMIT` (which currently also manages small 
file expansion on storage) will provide better control over estimation bias 
while maintaining efficient commit processing.
   
   **Additional context**
   Relevant code snippet showing the inconsistency:
   ```
   // Delta commits - checks individual files
   if (instant.getAction().equals(DELTA_COMMIT_ACTION)) {
       commitMetadata.getWriteStats().stream()
           .filter(hoodieWriteStat -> FSUtils.isBaseFile(...))
           .forEach(hoodieWriteStat -> 
averageRecordSizeStats.updateStats(hoodieWriteStat.getTotalWriteBytes(), 
hoodieWriteStat.getNumWrites()));
   } else {
       // Regular commits - checks aggregate size
       
averageRecordSizeStats.updateStats(commitMetadata.fetchTotalBytesWritten(), 
commitMetadata.fetchTotalRecordsWritten());
   }
   
   private void updateStats(long fileSizeInBytes, long recordWritten) {
     if (fileSizeInBytes > fileSizeThreshold && fileSizeInBytes > 
avgMetadataSize && recordWritten > 0) {
       totalBytesWritten.add(fileSizeInBytes - avgMetadataSize);
       totalRecordsWritten.add(recordWritten);
     }
   }
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to