[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4078: [HUDI-2833] Clean up unused archive files instead of expanding indefinitely.

GitBox Thu, 13 Jan 2022 18:17:04 -0800


zhangyue19921010 commented on a change in pull request #4078:
URL: https://github.com/apache/hudi/pull/4078#discussion_r784462454




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
##########
@@ -248,11 +253,32 @@ private HoodieInstant readCommit(GenericRecord record, 
boolean loadDetails) {
               break;
             }
           }
+        } catch (Exception originalException) {
+          // merge small archive files may left uncompleted archive file which 
will cause exception.
+          // need to ignore this kind of exception here.
+          try {
+            Path planPath = new Path(metaClient.getArchivePath(), 
"mergeArchivePlan");
+            HoodieWrapperFileSystem fileSystem = metaClient.getFs();
+            if (fileSystem.exists(planPath)) {
+              HoodieMergeArchiveFilePlan plan = 
TimelineMetadataUtils.deserializeAvroMetadata(FileIOUtils.readDataFromPath(fileSystem,
 planPath).get(), HoodieMergeArchiveFilePlan.class);
+              String mergedArchiveFileName = plan.getMergedArchiveFileName();
+              if (!StringUtils.isNullOrEmpty(mergedArchiveFileName) && 
fs.getPath().getName().equalsIgnoreCase(mergedArchiveFileName)) {
+                LOG.warn("Catch exception because of reading uncompleted 
merging archive file " + mergedArchiveFileName + ". Ignore it here.");
+                continue;
+              }
+            }
+            throw originalException;
+          } catch (Exception e) {
+            // If anything wrong during parsing merge archive plan, we need to 
throw the original exception.
+            // For example corrupted archive file and corrupted plan are both 
existed.
+            throw originalException;
+          }

Review comment:
       Hi @nsivabalan and @yihua 
   The common concern is incomplete/duplicate data left after last merging of 
small archive files fails and the current Hudi writer / commit is configured to 
disable archive file merging.
   
   Ideally we need to check and clean dirty data before every archive. 
   
   Why we need this button before do clean works I think are :
   This is a new feature, it's more safe with a default false control here.
   I am pretty worried about multi-writer here, at least we have a way to 
control only one writer could do merge works.
   
   As for making sure that incomplete data will cause no damage for loading 
archived timeline until next clean up:
   1. we use HashSet to avoid duplicate instants during loading archive 
instants. 
   2. we use this try-catch to deal with exception caused by loading incomplete 
merged small archive files.
   
   In the next step, maybe we can take care about multi-writer, runs stable for 
some time in my staging/production environment and finally removed this strict 
restrictions for `verifyLastMergeArchiveFilesIfNecessary ` here :)
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4078: [HUDI-2833] Clean up unused archive files instead of expanding indefinitely.

Reply via email to