suryaprasanna commented on PR #17943: URL: https://github.com/apache/hudi/pull/17943#issuecomment-3893702093
> @suryaprasanna : even updates can create new file slice, if we see an opportunity for small file handling. > > https://github.com/apache/hudi/blob/833ef62055e5d9b98b1c48196a02db7c7b9ca2e3/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L211 @nsivabalan the code reference you provided is for the assignInserts Method. Update records are directly added to the existing bucket. Otherwise record_index cannot maintain two different file_ids and also compaction will be a problem if the same records are spread across multiple file groups. Here is the assignUpdates logic, it does not deal with small files. ``` private void assignUpdates(WorkloadProfile profile) { // each update location gets a partition Set<Entry<String, WorkloadStat>> partitionStatEntries = profile.getInputPartitionPathStatMap().entrySet(); for (Map.Entry<String, WorkloadStat> partitionStat : partitionStatEntries) { WorkloadStat outputWorkloadStats = profile.getOutputPartitionPathStatMap().getOrDefault(partitionStat.getKey(), new WorkloadStat()); for (Map.Entry<String, Pair<String, Long>> updateLocEntry : partitionStat.getValue().getUpdateLocationToCount().entrySet()) { addUpdateBucket(partitionStat.getKey(), updateLocEntry.getKey()); if (profile.hasOutputWorkLoadStats()) { HoodieRecordLocation hoodieRecordLocation = new HoodieRecordLocation(updateLocEntry.getValue().getKey(), updateLocEntry.getKey()); outputWorkloadStats.addUpdates(hoodieRecordLocation, updateLocEntry.getValue().getValue()); } } if (profile.hasOutputWorkLoadStats()) { profile.updateOutputPartitionPathStatMap(partitionStat.getKey(), outputWorkloadStats); } } } private int addUpdateBucket(String partitionPath, String fileIdHint) { int bucket = totalBuckets; updateLocationToBucket.put(fileIdHint, bucket); BucketInfo bucketInfo = new BucketInfo(BucketType.UPDATE, fileIdHint, partitionPath); bucketInfoMap.put(totalBuckets, bucketInfo); totalBuckets++; return bucket; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
