[GitHub] [hudi] codope commented on a diff in pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

GitBox Sat, 22 Oct 2022 00:50:09 -0700


codope commented on code in PR #6581:
URL: https://github.com/apache/hudi/pull/6581#discussion_r1002393516



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -148,13 +151,26 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .key(PLAN_PARTITION_FILTER_MODE)
       .defaultValue(ClusteringPlanPartitionFilterMode.NONE)
       .sinceVersion("0.11.0")
-      .withDocumentation("Partition filter mode used in the creation of 
clustering plan. Available values are - "
-          + "NONE: do not filter table partition and thus the clustering plan 
will include all partitions that have clustering candidate."
-          + "RECENT_DAYS: keep a continuous range of partitions, worked 
together with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"
+      .withDocumentation("Partition Filter mode used in the creation of 
clustering plan. Available values are - "
+          + "NONE: do not filter anything and thus the clustering plan will 
include all file slices that have clustering candidate."
+          + "RECENT: keep a continuous range of partitions, worked together 
with configs '" + DAYBASED_LOOKBACK_PARTITIONS.key() + "' and '"

Review Comment:
   this is still `RECENT_DAYS` right?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/strategy/PartitionAwareClusteringPlanStrategy.java:
##########
@@ -83,12 +100,37 @@ public Option<HoodieClusteringPlan> 
generateClusteringPlan() {
       return Option.empty();
     }
 
+    Option<HoodieDefaultTimeline> toFilterTimeline = Option.empty();
+    Option<HoodieInstant> latestCompletedInstant = Option.empty();
+    // if filtering based on file slices are enabled, we need to pass the 
toFilter timeline below to assist in filtering.
+    if (config.getClusteringPlanFilterMode() != ClusteringPlanFilterMode.NONE) 
{
+      Option<HoodieInstant> latestScheduledReplaceCommit = 
metaClient.getActiveTimeline()
+          .filter(instant -> 
instant.getAction().equals(HoodieTimeline.REPLACE_COMMIT_ACTION))
+          .filter(instant -> instant.isRequested()).lastInstant();
+      if (latestScheduledReplaceCommit.isPresent()) {
+        HoodieClusteringPlan clusteringPlan = 
ClusteringUtils.getClusteringPlan(
+                metaClient, 
HoodieTimeline.getReplaceCommitRequestedInstant(latestScheduledReplaceCommit.get().getTimestamp()))
+            .map(Pair::getRight).orElseThrow(() -> new 
HoodieClusteringException(
+                "Unable to read clustering plan for instant: " + 
latestScheduledReplaceCommit.get().getTimestamp()));
+        if 
(!StringUtils.isNullOrEmpty(clusteringPlan.getLatestCompletedInstant())) {
+          toFilterTimeline = Option.of((HoodieDefaultTimeline)
+              metaClient.getActiveTimeline().filter(instant -> 
HoodieTimeline.compareTimestamps(instant.getTimestamp(), 
HoodieTimeline.GREATER_THAN, clusteringPlan.getLatestCompletedInstant())));
+        }
+      } else {
+        // if last clustering was archived
+        toFilterTimeline = Option.of(metaClient.getActiveTimeline());
+      }
+      latestCompletedInstant = 
metaClient.getActiveTimeline().filterCompletedInstants().lastInstant();
+    }
+
+    Option<HoodieDefaultTimeline> finalToFilterTimeline = toFilterTimeline;
     List<HoodieClusteringGroup> clusteringGroups = getEngineContext()
         .flatMap(
             partitionPaths,
             partitionPath -> {
               List<FileSlice> fileSlicesEligible = 
getFileSlicesEligibleForClustering(partitionPath).collect(Collectors.toList());
-              return buildClusteringGroupsForPartition(partitionPath, 
fileSlicesEligible).limit(getWriteConfig().getClusteringMaxNumGroups());
+              List<FileSlice> filteredFileSlices = 
filterFileSlices(fileSlicesEligible, finalToFilterTimeline, 
config.getClusteringPlanFilterMode());

Review Comment:
   this follows partition filtering.. should we disable one when the other is 
enabled otherwise it creates complicated combinations, e.g. let's say when 
users want certain partitions to be clustered using SELECTED_PARTITIONS 
starategy no matter whether they were part of recent commits or not?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -61,6 +62,8 @@ public class HoodieClusteringConfig extends HoodieConfig {
       
"org.apache.hudi.client.clustering.run.strategy.JavaSortAndSizeExecutionStrategy";
   public static final String PLAN_PARTITION_FILTER_MODE =
       "hoodie.clustering.plan.partition.filter.mode";
+  public static final String PLAN_FILTER_MODE =

Review Comment:
   Is additional config really necessary? Can we fold this into the existing 
config? Though I like the name of the new config, both configs serve a common 
purpose of filtering (in different ways), and lesser the configs the better. If 
we have to define a new config, then let's rename it to  
`hoodie.clustering.plan.filegroup.filter.mode` just to be more explicit and 
differentiate from the existing partition filter config.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on a diff in pull request #6581: [HUDI-4773] Adding New fitler mode to clustering to filter for recently touched files

Reply via email to