nsivabalan commented on a change in pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#discussion_r807373205



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -73,7 +73,8 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) {
       CleanPlanner<T, I, K, O> planner = new CleanPlanner<>(context, table, 
config);
       Option<HoodieInstant> earliestInstant = 
planner.getEarliestCommitToRetain();
       context.setJobStatus(this.getClass().getSimpleName(), "Obtaining list of 
partitions to be cleaned");
-      List<String> partitionsToClean = 
planner.getPartitionPathsToClean(earliestInstant);
+      boolean isDropPartition = planner.isDropPartition(earliestInstant);

Review comment:
       guess we have to rethink this a bit. We have diff clean policies and I 
am afraid this may not work for other policies. for eg, default is 
KEEP_LATEST_COMMITS and it might work, but for KEEP_LATEST_FILE_VERSIONS, 
earliestInstant is empty. 

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -400,6 +416,16 @@ private String 
getLatestVersionBeforeCommit(List<FileSlice> fileSliceList, Hoodi
     return deletePaths;
   }
 
+  public List<CleanFileInfo> getDeletePartitionPaths(String partitionPath) {

Review comment:
       so, in case of DeletePartition, we just include the partition path 
(directory) to CleanFileInfo and do not add explicit file groups/files within 
partition is it? 

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -85,10 +86,18 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) 
{
 
       context.setJobStatus(this.getClass().getSimpleName(), "Generating list 
of file slices to be cleaned");
 
-      Map<String, List<HoodieCleanFileInfo>> cleanOps = context
-          .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), 
cleanerParallelism)
-          .stream()
-          .collect(Collectors.toMap(Pair::getKey, y -> 
CleanerUtils.convertToHoodieCleanFileInfoList(y.getValue())));
+      Map<String, List<HoodieCleanFileInfo>> cleanOps;
+      if (isDropPartition) {
+        cleanOps = context
+            .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, 
planner.getDeletePartitionPaths(partitionPathToClean)), cleanerParallelism)

Review comment:
       shouldn't we atleast fix CleanFileInfo to have another variable called 
isPartitionPath or something. 
   

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -113,21 +117,29 @@ public CleanPlanner(HoodieEngineContext context, 
HoodieTable<T, I, K, O> hoodieT
     return metadata.getPartitionMetadata().values().stream().flatMap(s -> 
s.getSavepointDataFile().stream());
   }
 
+  public List<String> getPartitionPathsToClean(Option<HoodieInstant> 
earliestRetainedInstant) throws IOException {

Review comment:
       is this used anywhere? 

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
##########
@@ -85,10 +86,18 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) 
{
 
       context.setJobStatus(this.getClass().getSimpleName(), "Generating list 
of file slices to be cleaned");
 
-      Map<String, List<HoodieCleanFileInfo>> cleanOps = context
-          .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, planner.getDeletePaths(partitionPathToClean)), 
cleanerParallelism)
-          .stream()
-          .collect(Collectors.toMap(Pair::getKey, y -> 
CleanerUtils.convertToHoodieCleanFileInfoList(y.getValue())));
+      Map<String, List<HoodieCleanFileInfo>> cleanOps;
+      if (isDropPartition) {
+        cleanOps = context
+            .map(partitionsToClean, partitionPathToClean -> 
Pair.of(partitionPathToClean, 
planner.getDeletePartitionPaths(partitionPathToClean)), cleanerParallelism)

Review comment:
       from what I infer, we are just returning the partition path directory as 
part of List<HoodieCleanFileInfo> here and don't explicitly add every file 
group within the partition. since we store entries in metadata table keyed on 
partition path, we will just directly delete the entry and so add every file 
group info is not required? can you confirm my understanding. 
   

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -432,4 +458,47 @@ private boolean 
isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDropPartition(Option<HoodieInstant> instantToRetain) {

Review comment:
       isDeletePartitionOperation

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -432,4 +458,47 @@ private boolean 
isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDropPartition(Option<HoodieInstant> instantToRetain) {
+    try {
+      if (instantToRetain.isPresent() && 
HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instantToRetain.get().getAction())) 
{
+        HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata.fromBytes(
+            
hoodieTable.getActiveTimeline().getInstantDetails(instantToRetain.get()).get(), 
HoodieReplaceCommitMetadata.class);
+
+        if (replaceCommitMetadata != null
+            && 
WriteOperationType.DELETE_PARTITION.equals(replaceCommitMetadata.getOperationType()))
 {
+          return true;
+        }
+      }
+    } catch (Exception e) {
+      throw new HoodieException("Failed to get commit metadata", e);
+    }
+    return false;
+  }
+
+  public List<String> getDropPartitions(Option<HoodieInstant> instantToRetain) 
{
+    try {
+      if (!instantToRetain.isPresent()) {
+        LOG.info("No earliest commit to retain. No need to scan partitions 
!!");
+        return Collections.emptyList();
+      }
+
+      if 
(HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instantToRetain.get().getAction()))
 {

Review comment:
       I am not sure if we can map earliest instant to retain to clean up 
partitions to delete. 
   
   for eg:
   C1, C2(delete partition), C3, C4. 
   
   lets say after C4, earliest commit to retain is C2: this essentially means 
that cleaner has to clean all data files for all commits < C2. and not touch 
anything pertaining to C2. but here wrt getDropPartition, we are breaking that 
and triggering deletion of partitions from C2. 
   
   or am I missing something? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to