Re: [PR] perf: [Flink] Skip clustering for partitions with only one left small file in FlinkClusteringPlanStrategy [hudi]

via GitHub Sat, 18 Oct 2025 06:18:28 -0700


zoomake commented on code in PR #14087:
URL: https://github.com/apache/hudi/pull/14087#discussion_r2432926996



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/FlinkSizeBasedClusteringPlanStrategy.java:
##########
@@ -72,8 +73,18 @@ protected Map<String, String> getStrategyParams() {
 
   @Override
   protected Stream<FileSlice> getFileSlicesEligibleForClustering(final String 
partition) {
-    return super.getFileSlicesEligibleForClustering(partition)
-        // Only files that have base file size smaller than small file size 
are eligible.
-        .filter(slice -> 
slice.getBaseFile().map(HoodieBaseFile::getFileSize).orElse(0L) < 
getWriteConfig().getClusteringSmallFileLimit());
+    Supplier<Stream<FileSlice>> streamSupplier = () -> 
super.getFileSlicesEligibleForClustering(partition)

Review Comment:
   @danny0405  Thank you for your feedback! We need the Supplier because the 
stream is consumed twice (for counting and returning results), and Java streams 
are single-use. 
   
   I’ve tested it without using super, the returned stream gets consumed and 
cannot be reused, so I kept the Supplier to ensure we can safely reuse the 
filtered stream.
   
   Removing it would break the logic.
   
   We could collect the stream into a List to avoid the Supplier, like this:
   
   `protected Stream<FileSlice> getFileSlicesEligibleForClustering(final String 
partition) {
     List<FileSlice> slices = 
super.getFileSlicesEligibleForClustering(partition)
             .filter(slice -> 
slice.getBaseFile().map(HoodieBaseFile::getFileSize).orElse(0L)
                     < getWriteConfig().getClusteringSmallFileLimit())
             .collect(Collectors.toList());
   
     if 
(!StringUtils.isNullOrEmpty(getWriteConfig().getClusteringSortColumns())) {
         return slices.stream();
     }
   
     if (slices.size() > 1) {
         return slices.stream();
     }
     return Stream.empty();
   }`
   
   This uses more memory. If the dataset is small, we can remove the Supplier. 
Do you prefer keeping it for efficiency or switching to the List approach?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf: [Flink] Skip clustering for partitions with only one left small file in FlinkClusteringPlanStrategy [hudi]

Reply via email to