codope commented on a change in pull request #3833:
URL: https://github.com/apache/hudi/pull/3833#discussion_r740188385



##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkSizeBasedClusteringPlanStrategy.java
##########
@@ -71,24 +71,30 @@ public 
SparkSizeBasedClusteringPlanStrategy(HoodieSparkMergeOnReadTable<T> table
     List<Pair<List<FileSlice>, Integer>> fileSliceGroups = new ArrayList<>();
     List<FileSlice> currentGroup = new ArrayList<>();
     long totalSizeSoFar = 0;
+    HoodieWriteConfig writeConfig = getWriteConfig();
     for (FileSlice currentSlice : fileSlices) {
       // assume each filegroup size is ~= parquet.max.file.size
-      totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? 
currentSlice.getBaseFile().get().getFileSize() : 
getWriteConfig().getParquetMaxFileSize();
+      totalSizeSoFar += currentSlice.getBaseFile().isPresent() ? 
currentSlice.getBaseFile().get().getFileSize() : 
writeConfig.getParquetMaxFileSize();
       // check if max size is reached and create new group, if needed.
-      if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() && 
!currentGroup.isEmpty()) {
-        int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, 
getWriteConfig().getClusteringTargetFileMaxBytes());
+      if (totalSizeSoFar >= writeConfig.getClusteringMaxBytesInGroup() && 
!currentGroup.isEmpty()) {
+        int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar, 
writeConfig.getClusteringTargetFileMaxBytes());
         LOG.info("Adding one clustering group " + totalSizeSoFar + " max 
bytes: "
-            + getWriteConfig().getClusteringMaxBytesInGroup() + " num input 
slices: " + currentGroup.size() + " output groups: " + numOutputGroups);
+            + writeConfig.getClusteringMaxBytesInGroup() + " num input slices: 
" + currentGroup.size() + " output groups: " + numOutputGroups);
         fileSliceGroups.add(Pair.of(currentGroup, numOutputGroups));
         currentGroup = new ArrayList<>();
         totalSizeSoFar = 0;
       }
       currentGroup.add(currentSlice);
+      // totalSizeSoFar could be 0 when new group was created in the previous 
conditional block.
+      // reset to the size of current slice, otherwise the number of output 
file group will become 0 even though current slice is present.
+      if (totalSizeSoFar == 0) {

Review comment:
       You're right that totalSizeSoFar is set to 0 only when a group quota is 
complete in conditional block. However, notice that at #L87 current group is 
populated but totalSizeSoFar still remains 0. We need to set the totalSizeSoFar 
to reflect the current size otherwise the numOutputGroups at #L95 will be 0 
even though the current group is not empty.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to