zhangyue19921010 commented on a change in pull request #4078:
URL: https://github.com/apache/hudi/pull/4078#discussion_r784462454
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
##########
@@ -248,11 +253,32 @@ private HoodieInstant readCommit(GenericRecord record,
boolean loadDetails) {
break;
}
}
+ } catch (Exception originalException) {
+ // merge small archive files may left uncompleted archive file which
will cause exception.
+ // need to ignore this kind of exception here.
+ try {
+ Path planPath = new Path(metaClient.getArchivePath(),
"mergeArchivePlan");
+ HoodieWrapperFileSystem fileSystem = metaClient.getFs();
+ if (fileSystem.exists(planPath)) {
+ HoodieMergeArchiveFilePlan plan =
TimelineMetadataUtils.deserializeAvroMetadata(FileIOUtils.readDataFromPath(fileSystem,
planPath).get(), HoodieMergeArchiveFilePlan.class);
+ String mergedArchiveFileName = plan.getMergedArchiveFileName();
+ if (!StringUtils.isNullOrEmpty(mergedArchiveFileName) &&
fs.getPath().getName().equalsIgnoreCase(mergedArchiveFileName)) {
+ LOG.warn("Catch exception because of reading uncompleted
merging archive file " + mergedArchiveFileName + ". Ignore it here.");
+ continue;
+ }
+ }
+ throw originalException;
+ } catch (Exception e) {
+ // If anything wrong during parsing merge archive plan, we need to
throw the original exception.
+ // For example corrupted archive file and corrupted plan are both
existed.
+ throw originalException;
+ }
Review comment:
Hi @nsivabalan and @yihua
The common concern is incomplete/duplicate data left after last merging of
small archive files fails and the current Hudi writer / commit is configured to
disable archive file merging.
Ideally we need to check and clean dirty data before every archive.
Why we need this button before do clean works I think are :
This is a new feature, it's more safe with a default false control here.
I am pretty worried about multi-writer here, at least we have a way to
control only one writer could do merge works.
As for making sure that incomplete data will cause no damage for loading
archived timeline until next clean up:
1. we use HashSet to avoid duplicate instants during loading archive
instants.
2. we use this try-catch to deal with exception caused by loading incomplete
merged small archive files.
In the next step, maybe we can take care about multi-writer, runs stable for
some time in my staging/production environment and finally removed this strict
restrictions for `verifyLastMergeArchiveFilesIfNecessary ` here :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]