codope commented on a change in pull request #3833:
URL: https://github.com/apache/hudi/pull/3833#discussion_r740188385
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/plan/strategy/SparkSizeBasedClusteringPlanStrategy.java
##########
@@ -71,24 +71,30 @@ public
SparkSizeBasedClusteringPlanStrategy(HoodieSparkMergeOnReadTable<T> table
List<Pair<List<FileSlice>, Integer>> fileSliceGroups = new ArrayList<>();
List<FileSlice> currentGroup = new ArrayList<>();
long totalSizeSoFar = 0;
+ HoodieWriteConfig writeConfig = getWriteConfig();
for (FileSlice currentSlice : fileSlices) {
// assume each filegroup size is ~= parquet.max.file.size
- totalSizeSoFar += currentSlice.getBaseFile().isPresent() ?
currentSlice.getBaseFile().get().getFileSize() :
getWriteConfig().getParquetMaxFileSize();
+ totalSizeSoFar += currentSlice.getBaseFile().isPresent() ?
currentSlice.getBaseFile().get().getFileSize() :
writeConfig.getParquetMaxFileSize();
// check if max size is reached and create new group, if needed.
- if (totalSizeSoFar >= getWriteConfig().getClusteringMaxBytesInGroup() &&
!currentGroup.isEmpty()) {
- int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar,
getWriteConfig().getClusteringTargetFileMaxBytes());
+ if (totalSizeSoFar >= writeConfig.getClusteringMaxBytesInGroup() &&
!currentGroup.isEmpty()) {
+ int numOutputGroups = getNumberOfOutputFileGroups(totalSizeSoFar,
writeConfig.getClusteringTargetFileMaxBytes());
LOG.info("Adding one clustering group " + totalSizeSoFar + " max
bytes: "
- + getWriteConfig().getClusteringMaxBytesInGroup() + " num input
slices: " + currentGroup.size() + " output groups: " + numOutputGroups);
+ + writeConfig.getClusteringMaxBytesInGroup() + " num input slices:
" + currentGroup.size() + " output groups: " + numOutputGroups);
fileSliceGroups.add(Pair.of(currentGroup, numOutputGroups));
currentGroup = new ArrayList<>();
totalSizeSoFar = 0;
}
currentGroup.add(currentSlice);
+ // totalSizeSoFar could be 0 when new group was created in the previous
conditional block.
+ // reset to the size of current slice, otherwise the number of output
file group will become 0 even though current slice is present.
+ if (totalSizeSoFar == 0) {
Review comment:
You're right that totalSizeSoFar is set to 0 only when a group quota is
complete in conditional block. However, notice that at #L87 current group is
populated but totalSizeSoFar still remains 0. We need to set the totalSizeSoFar
to reflect the current size otherwise the numOutputGroups at #L95 will be 0
even though the current group is not empty.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]