[GitHub] [hudi] xushiyan commented on a change in pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

GitBox Sat, 12 Mar 2022 03:28:47 -0800


xushiyan commented on a change in pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#discussion_r825286337




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkCopyOnWriteTable.java
##########
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.table;
 
+import org.apache.avro.Schema;

Review comment:
       there is no other code change in this file. should avoid re-ordering 
imports here

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -472,4 +499,21 @@ private boolean 
isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDeletePartitionOperation(Option<HoodieInstant> 
instantToRetain) {

Review comment:
       it actually looks better to be in some timeline utils class instead of 
in `CleanPlanner`

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -206,7 +212,13 @@ public CleanPlanner(HoodieEngineContext context, 
HoodieTable<T, I, K, O> hoodieT
    */
   private List<String> getPartitionPathsForFullCleaning() {
     // Go to brute force mode of scanning all partitions
-    return FSUtils.getAllPartitionPaths(context, config.getMetadataConfig(), 
config.getBasePath());
+    try {
+      FileSystemBackedTableMetadata fsBackedTableMetadata = new 
FileSystemBackedTableMetadata(context,
+          context.getHadoopConf(), config.getBasePath(), 
config.shouldAssumeDatePartitioning());
+      return fsBackedTableMetadata.getAllPartitionPaths();
+    } catch (IOException e) {
+      return Collections.emptyList();

Review comment:
       can you clarify why we have to use the concrete implementation 
`FileSystemBackedTableMetadata` here? the existing code uses its interface 
`HoodieTableMetadata` as an abstraction.

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -159,12 +163,14 @@ public CleanPlanner(HoodieEngineContext context, 
HoodieTable<T, I, K, O> hoodieT
                   
.deserializeHoodieCleanMetadata(hoodieTable.getActiveTimeline().getInstantDetails(lastClean.get()).get());
           if ((cleanMetadata.getEarliestCommitToRetain() != null)
                   && (cleanMetadata.getEarliestCommitToRetain().length() > 0)) 
{
-            return getPartitionPathsForIncrementalCleaning(cleanMetadata, 
instantToRetain);
+            return getPartitionPathsForIncrementalCleaning(cleanMetadata, 
instantToRetain).stream()
+                .distinct().collect(Collectors.toList());
           }
         }
       }
     }
-    return getPartitionPathsForFullCleaning();
+
+    return 
getPartitionPathsForFullCleaning().stream().distinct().collect(Collectors.toList());

Review comment:
       why we need the deduplication here? 
`getPartitionPathsForIncrementalCleaning` already calls `distinct()` plus for 
both methods, it shouldn't be the caller's responsibility to deduplicate; the 
APIs should take care of it internally

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -157,10 +157,13 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
                                                                           
String instantTime) {
     List<HoodieRecord> records = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().size());
 
-    // Add record bearing partitions list
-    ArrayList<String> partitionsList = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet());
+    // Add record bearing added partitions list
+    ArrayList<String> partitionsAdded = new 
ArrayList<>(commitMetadata.getPartitionToWriteStats().keySet());
+
+    // Add record bearing deleted partitions list
+    ArrayList<String> partitionsDeleted = getPartitionsDeleted(commitMetadata);

Review comment:
       should just declare as `List<String>`

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##########
@@ -542,8 +542,6 @@ private void transitionState(HoodieInstant fromInstant, 
HoodieInstant toInstant,
       } else {
         // Ensures old state exists in timeline
         LOG.info("Checking for file exists ?" + new 
Path(metaClient.getMetaPath(), fromInstant.getFileName()));
-        ValidationUtils.checkArgument(metaClient.getFs().exists(new 
Path(metaClient.getMetaPath(),
-            fromInstant.getFileName())));

Review comment:
       why this not needed any more?

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
##########
@@ -472,4 +499,21 @@ private boolean 
isFileSliceNeededForPendingCompaction(FileSlice fileSlice) {
   private boolean isFileGroupInPendingCompaction(HoodieFileGroup fg) {
     return fgIdToPendingCompactionOperations.containsKey(fg.getFileGroupId());
   }
+
+  public boolean isDeletePartitionOperation(Option<HoodieInstant> 
instantToRetain) {

Review comment:
       can we make it private if not intend to use else where? also prefer to 
have a static helper for this, instead of couple it with CleanPlanner




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a change in pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

Reply via email to