zoomake commented on code in PR #14087:
URL: https://github.com/apache/hudi/pull/14087#discussion_r2432926996


##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/FlinkSizeBasedClusteringPlanStrategy.java:
##########
@@ -72,8 +73,18 @@ protected Map<String, String> getStrategyParams() {
 
   @Override
   protected Stream<FileSlice> getFileSlicesEligibleForClustering(final String 
partition) {
-    return super.getFileSlicesEligibleForClustering(partition)
-        // Only files that have base file size smaller than small file size 
are eligible.
-        .filter(slice -> 
slice.getBaseFile().map(HoodieBaseFile::getFileSize).orElse(0L) < 
getWriteConfig().getClusteringSmallFileLimit());
+    Supplier<Stream<FileSlice>> streamSupplier = () -> 
super.getFileSlicesEligibleForClustering(partition)

Review Comment:
   Thank you for your feedback! We need the Supplier because the stream is 
consumed twice (for counting and returning results), and Java streams are 
single-use. Removing it would break the logic.
   We could collect the stream into a List to avoid the Supplier, like this:
   
   `protected Stream<FileSlice> getFileSlicesEligibleForClustering(final String 
partition) {
     List<FileSlice> slices = 
super.getFileSlicesEligibleForClustering(partition)
             .filter(slice -> 
slice.getBaseFile().map(HoodieBaseFile::getFileSize).orElse(0L)
                     < getWriteConfig().getClusteringSmallFileLimit())
             .collect(Collectors.toList());
   
     if 
(!StringUtils.isNullOrEmpty(getWriteConfig().getClusteringSortColumns())) {
         return slices.stream();
     }
   
     if (slices.size() > 1) {
         return slices.stream();
     }
     return Stream.empty();
   }`
   
   This uses more memory. If the dataset is small, we can remove the Supplier. 
Do you prefer keeping it for efficiency or switching to the List approach?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to