[ https://issues.apache.org/jira/browse/HBASE-28462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848538#comment-17848538 ]
Nick Dimiduk commented on HBASE-28462: -------------------------------------- I think the snapshot manager should do something similar as well., HBASE-19681. > Incremental backup can fail if log gets archived while WALPlayer is starting > up > ------------------------------------------------------------------------------- > > Key: HBASE-28462 > URL: https://issues.apache.org/jira/browse/HBASE-28462 > Project: HBase > Issue Type: Bug > Components: backup&restore > Reporter: Bryan Beaudreault > Priority: Major > > We had incremental backup fail with FileNotFoundException for a file in the > WALs directory. Upon investigation, the log had been archived a few mins > earlier. WALInputFormat's record reader has support for falling back on an > archived path: > {code:java} > } catch (IOException e) { > Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf); > // archivedLog can be null if unable to locate in archiveDir. > if (archivedLog != null) { > openReader(archivedLog); > // Try call again in recursion > return nextKeyValue(); > } else { > throw e; > } > } {code} > But the getSplits method has different handling: > {code:java} > try { > List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime); > allFiles.addAll(files); > } catch (FileNotFoundException e) { > if (ignoreMissing) { > LOG.warn("File " + inputPath + " is missing. Skipping it."); > continue; > } > throw e; > } {code} > This ignoreMissing variable was added in HBASE-14141 and is enabled via > wal.input.ignore.missing.files which is defaulted to false and never set. > Looking at the comment and reviewboard history of HBASE-14141 I think there > might have been some confusion about where to handle these missing files, and > this got lost in the shuffle. > > I would prefer not to ignore missing hfiles. I think that could result in > some weird behavior: > * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be > backed up > * The process starts, and while it's running 1 of those 30 WALs gets > archived. That would get skipped due to FileNotFoundException > * But the remaining 29 would be backed up > This scenario could cause some data consistency issues if this incremental > backup is restored. We missed some edits in the middle of applied edits from > other WALs. > So I do think failing as we do today is necessary for consistency, but > unrealistic in a live cluster. The solution is to try finding the missing > file in the archived directory. Backups has a coprocessor which will not > allow the archived file to be cleaned up until it's backed up, so I think > it's safe to say that a WAL is either definitely in WALs or oldWALs. -- This message was sent by Atlassian Jira (v8.20.10#820010)