[jira] [Updated] (HBASE-28462) Incremental backup can fail if log gets archived while WALPlayer is starting up

Bryan Beaudreault (Jira) Wed, 27 Mar 2024 14:04:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-28462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Beaudreault updated HBASE-28462:
--------------------------------------
    Description: 
We had incremental backup fail with FileNotFoundException for a file in the 
WALs directory. Upon investigation, the log had been archived a few mins 
earlier. WALInputFormat's record reader has support for falling back on an 
archived path:
{code:java}
} catch (IOException e) {
  Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
  // archivedLog can be null if unable to locate in archiveDir.
  if (archivedLog != null) {
    openReader(archivedLog);
    // Try call again in recursion
    return nextKeyValue();
  } else {
    throw e;
  }
} {code}
But the getSplits method has different handling:
{code:java}
try {
  List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
  allFiles.addAll(files);
} catch (FileNotFoundException e) {
  if (ignoreMissing) {
    LOG.warn("File " + inputPath + " is missing. Skipping it.");
    continue;
  }
  throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via 
wal.input.ignore.missing.files which is defaulted to false and never set. 
Looking at the comment and reviewboard history of HBASE-14141 I think there 
might have been some confusion about where to handle these missing files, and 
this got lost in the shuffle.
 
I would prefer not to ignore missing hfiles. I think that could result in some 
weird behavior:
 * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be 
backed up
 * The process starts, and while it's running 1 of those 30 WALs gets archived. 
That would get skipped due to FileNotFoundException
 * But the remaining 29 would be backed up

This scenario could cause some data consistency issues if this incremental 
backup is restored. We missed some edits in the middle of applied edits from 
other WALs.

So I do think failing as we do today is necessary for consistency, but 
unrealistic in a live cluster. The solution is to try finding the missing file 
in the archived directory. Backups has a coprocessor which will not allow the 
archived file to be cleaned up until it's backed up, so I think it's safe to 
say that a WAL is either definitely in WALs or oldWALs.

  was:
We had incremental backup fail with FileNotFoundException for a file in the 
WALs directory. Upon investigation, the log had been archived a few mins 
earlier. WALInputFormat's record reader has support for falling back on an 
archived path:
{code:java}
} catch (IOException e) {
  Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
  // archivedLog can be null if unable to locate in archiveDir.
  if (archivedLog != null) {
    openReader(archivedLog);
    // Try call again in recursion
    return nextKeyValue();
  } else {
    throw e;
  }
} {code}
But the getSplits method has different handling:
{code:java}
try {
  List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
  allFiles.addAll(files);
} catch (FileNotFoundException e) {
  if (ignoreMissing) {
    LOG.warn("File " + inputPath + " is missing. Skipping it.");
    continue;
  }
  throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via 
wal.input.ignore.missing.files which is defaulted to false and never set. 
Looking at the comment and reviewboard history of HBASE-14141 I think there 
might have been some confusion about where to handle these missing files, and 
this got lost in the shuffle.
 
I would prefer not to ignore missing hfiles. I think that could result in some 
weird behavior:
 * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be 
backed up
 * The process starts, and while it's running 1 of those 30 WALs gets archived. 
That would get skipped due to FileNotFoundException
 * But the remaining 29 would be backed up

This scenario could cause some data consistency issues if this incremental 
backup is restored. We missed some edits in the middle of applied edits from 
other WALs.

So I do think failing as we do today is necessary for consistency, but 
unrealistic in a live cluster. The solution is to try finding the missing file 
in the archived directory. Backups has a coprocessor which will not allow the 
archived file to be cleaned up until it's backed up, so I think it's safe to 
say that a WAL is either definitely in WALs or oldWALs.
 *  

- 


> Incremental backup can fail if log gets archived while WALPlayer is starting 
> up
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-28462
>                 URL: https://issues.apache.org/jira/browse/HBASE-28462
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&amp;restore
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> We had incremental backup fail with FileNotFoundException for a file in the 
> WALs directory. Upon investigation, the log had been archived a few mins 
> earlier. WALInputFormat's record reader has support for falling back on an 
> archived path:
> {code:java}
> } catch (IOException e) {
>   Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
>   // archivedLog can be null if unable to locate in archiveDir.
>   if (archivedLog != null) {
>     openReader(archivedLog);
>     // Try call again in recursion
>     return nextKeyValue();
>   } else {
>     throw e;
>   }
> } {code}
> But the getSplits method has different handling:
> {code:java}
> try {
>   List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
>   allFiles.addAll(files);
> } catch (FileNotFoundException e) {
>   if (ignoreMissing) {
>     LOG.warn("File " + inputPath + " is missing. Skipping it.");
>     continue;
>   }
>   throw e;
> } {code}
> This ignoreMissing variable was added in HBASE-14141 and is enabled via 
> wal.input.ignore.missing.files which is defaulted to false and never set. 
> Looking at the comment and reviewboard history of HBASE-14141 I think there 
> might have been some confusion about where to handle these missing files, and 
> this got lost in the shuffle.
>  
> I would prefer not to ignore missing hfiles. I think that could result in 
> some weird behavior:
>  * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be 
> backed up
>  * The process starts, and while it's running 1 of those 30 WALs gets 
> archived. That would get skipped due to FileNotFoundException
>  * But the remaining 29 would be backed up
> This scenario could cause some data consistency issues if this incremental 
> backup is restored. We missed some edits in the middle of applied edits from 
> other WALs.
> So I do think failing as we do today is necessary for consistency, but 
> unrealistic in a live cluster. The solution is to try finding the missing 
> file in the archived directory. Backups has a coprocessor which will not 
> allow the archived file to be cleaned up until it's backed up, so I think 
> it's safe to say that a WAL is either definitely in WALs or oldWALs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-28462) Incremental backup can fail if log gets archived while WALPlayer is starting up

Reply via email to