[
https://issues.apache.org/jira/browse/HBASE-28462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831668#comment-17831668
]
Dieter De Paepe commented on HBASE-28462:
-----------------------------------------
Hi [~bbeaudreault], could this be related to HBASE-28461?
> Incremental backup can fail if log gets archived while WALPlayer is starting
> up
> -------------------------------------------------------------------------------
>
> Key: HBASE-28462
> URL: https://issues.apache.org/jira/browse/HBASE-28462
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Bryan Beaudreault
> Priority: Major
>
> We had incremental backup fail with FileNotFoundException for a file in the
> WALs directory. Upon investigation, the log had been archived a few mins
> earlier. WALInputFormat's record reader has support for falling back on an
> archived path:
> {code:java}
> } catch (IOException e) {
> Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
> // archivedLog can be null if unable to locate in archiveDir.
> if (archivedLog != null) {
> openReader(archivedLog);
> // Try call again in recursion
> return nextKeyValue();
> } else {
> throw e;
> }
> } {code}
> But the getSplits method has different handling:
> {code:java}
> try {
> List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
> allFiles.addAll(files);
> } catch (FileNotFoundException e) {
> if (ignoreMissing) {
> LOG.warn("File " + inputPath + " is missing. Skipping it.");
> continue;
> }
> throw e;
> } {code}
> This ignoreMissing variable was added in HBASE-14141 and is enabled via
> wal.input.ignore.missing.files which is defaulted to false and never set.
> Looking at the comment and reviewboard history of HBASE-14141 I think there
> might have been some confusion about where to handle these missing files, and
> this got lost in the shuffle.
>
> I would prefer not to ignore missing hfiles. I think that could result in
> some weird behavior:
> * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be
> backed up
> * The process starts, and while it's running 1 of those 30 WALs gets
> archived. That would get skipped due to FileNotFoundException
> * But the remaining 29 would be backed up
> This scenario could cause some data consistency issues if this incremental
> backup is restored. We missed some edits in the middle of applied edits from
> other WALs.
> So I do think failing as we do today is necessary for consistency, but
> unrealistic in a live cluster. The solution is to try finding the missing
> file in the archived directory. Backups has a coprocessor which will not
> allow the archived file to be cleaned up until it's backed up, so I think
> it's safe to say that a WAL is either definitely in WALs or oldWALs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)