Re: [PR] perf: Optimize clean operation by skipping unnecessary checks in MOR tables [hudi]

via GitHub Thu, 12 Feb 2026 14:20:00 -0800


suryaprasanna commented on PR #17943:
URL: https://github.com/apache/hudi/pull/17943#issuecomment-3893702093


   > @suryaprasanna : even updates can create new file slice, if we see an 
opportunity for small file handling.
   > 
   > 
https://github.com/apache/hudi/blob/833ef62055e5d9b98b1c48196a02db7c7b9ca2e3/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L211
   
   @nsivabalan  the code reference you provided is for the assignInserts 
Method. Update records are directly added to the existing bucket. Otherwise 
record_index cannot maintain two different file_ids and also compaction will be 
a problem if the same records are spread across multiple file groups.
   
   Here is the assignUpdates logic, it does not deal with small files.
   ```
     private void assignUpdates(WorkloadProfile profile) {
       // each update location gets a partition
       Set<Entry<String, WorkloadStat>> partitionStatEntries = 
profile.getInputPartitionPathStatMap().entrySet();
       for (Map.Entry<String, WorkloadStat> partitionStat : 
partitionStatEntries) {
         WorkloadStat outputWorkloadStats = 
profile.getOutputPartitionPathStatMap().getOrDefault(partitionStat.getKey(), 
new WorkloadStat());
         for (Map.Entry<String, Pair<String, Long>> updateLocEntry :
             partitionStat.getValue().getUpdateLocationToCount().entrySet()) {
           addUpdateBucket(partitionStat.getKey(), updateLocEntry.getKey());
           if (profile.hasOutputWorkLoadStats()) {
             HoodieRecordLocation hoodieRecordLocation = new 
HoodieRecordLocation(updateLocEntry.getValue().getKey(), 
updateLocEntry.getKey());
             outputWorkloadStats.addUpdates(hoodieRecordLocation, 
updateLocEntry.getValue().getValue());
           }
         }
         if (profile.hasOutputWorkLoadStats()) {
           profile.updateOutputPartitionPathStatMap(partitionStat.getKey(), 
outputWorkloadStats);
         }
       }
     }
   
     private int addUpdateBucket(String partitionPath, String fileIdHint) {
       int bucket = totalBuckets;
       updateLocationToBucket.put(fileIdHint, bucket);
       BucketInfo bucketInfo = new BucketInfo(BucketType.UPDATE, fileIdHint, 
partitionPath);
       bucketInfoMap.put(totalBuckets, bucketInfo);
       totalBuckets++;
       return bucket;
     }
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf: Optimize clean operation by skipping unnecessary checks in MOR tables [hudi]

Reply via email to