[ 
https://issues.apache.org/jira/browse/HBASE-28462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848538#comment-17848538
 ] 

Nick Dimiduk commented on HBASE-28462:
--------------------------------------

I think the snapshot manager should do something similar as well., HBASE-19681.

> Incremental backup can fail if log gets archived while WALPlayer is starting 
> up
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-28462
>                 URL: https://issues.apache.org/jira/browse/HBASE-28462
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> We had incremental backup fail with FileNotFoundException for a file in the 
> WALs directory. Upon investigation, the log had been archived a few mins 
> earlier. WALInputFormat's record reader has support for falling back on an 
> archived path:
> {code:java}
> } catch (IOException e) {
>   Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
>   // archivedLog can be null if unable to locate in archiveDir.
>   if (archivedLog != null) {
>     openReader(archivedLog);
>     // Try call again in recursion
>     return nextKeyValue();
>   } else {
>     throw e;
>   }
> } {code}
> But the getSplits method has different handling:
> {code:java}
> try {
>   List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
>   allFiles.addAll(files);
> } catch (FileNotFoundException e) {
>   if (ignoreMissing) {
>     LOG.warn("File " + inputPath + " is missing. Skipping it.");
>     continue;
>   }
>   throw e;
> } {code}
> This ignoreMissing variable was added in HBASE-14141 and is enabled via 
> wal.input.ignore.missing.files which is defaulted to false and never set. 
> Looking at the comment and reviewboard history of HBASE-14141 I think there 
> might have been some confusion about where to handle these missing files, and 
> this got lost in the shuffle.
>  
> I would prefer not to ignore missing hfiles. I think that could result in 
> some weird behavior:
>  * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be 
> backed up
>  * The process starts, and while it's running 1 of those 30 WALs gets 
> archived. That would get skipped due to FileNotFoundException
>  * But the remaining 29 would be backed up
> This scenario could cause some data consistency issues if this incremental 
> backup is restored. We missed some edits in the middle of applied edits from 
> other WALs.
> So I do think failing as we do today is necessary for consistency, but 
> unrealistic in a live cluster. The solution is to try finding the missing 
> file in the archived directory. Backups has a coprocessor which will not 
> allow the archived file to be cleaned up until it's backed up, so I think 
> it's safe to say that a WAL is either definitely in WALs or oldWALs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to