[
https://issues.apache.org/jira/browse/HBASE-28462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Beaudreault updated HBASE-28462:
--------------------------------------
Description:
We had incremental backup fail with FileNotFoundException for a file in the
WALs directory. Upon investigation, the log had been archived a few mins
earlier. WALInputFormat's record reader has support for falling back on an
archived path:
{code:java}
} catch (IOException e) {
Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
// archivedLog can be null if unable to locate in archiveDir.
if (archivedLog != null) {
openReader(archivedLog);
// Try call again in recursion
return nextKeyValue();
} else {
throw e;
}
} {code}
But the getSplits method has different handling:
{code:java}
try {
List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
allFiles.addAll(files);
} catch (FileNotFoundException e) {
if (ignoreMissing) {
LOG.warn("File " + inputPath + " is missing. Skipping it.");
continue;
}
throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via
wal.input.ignore.missing.files which is defaulted to false and never set.
Looking at the comment and reviewboard history of HBASE-14141 I think there
might have been some confusion about where to handle these missing files, and
this got lost in the shuffle.
I would prefer not to ignore missing hfiles. I think that could result in some
weird behavior:
* RegionServer has 10 archived and 30 not-yet-archived WALs needing to be
backed up
* The process starts, and while it's running 1 of those 30 WALs gets archived.
That would get skipped due to FileNotFoundException
* But the remaining 29 would be backed up
This scenario could cause some data consistency issues if this incremental
backup is restored. We missed some edits in the middle of applied edits from
other WALs.
So I do think failing as we do today is necessary for consistency, but
unrealistic in a live cluster. The solution is to try finding the missing file
in the archived directory. Backups has a coprocessor which will not allow the
archived file to be cleaned up until it's backed up, so I think it's safe to
say that a WAL is either definitely in WALs or oldWALs.
was:
We had incremental backup fail with FileNotFoundException for a file in the
WALs directory. Upon investigation, the log had been archived a few mins
earlier. WALInputFormat's record reader has support for falling back on an
archived path:
{code:java}
} catch (IOException e) {
Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
// archivedLog can be null if unable to locate in archiveDir.
if (archivedLog != null) {
openReader(archivedLog);
// Try call again in recursion
return nextKeyValue();
} else {
throw e;
}
} {code}
But the getSplits method has different handling:
{code:java}
try {
List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
allFiles.addAll(files);
} catch (FileNotFoundException e) {
if (ignoreMissing) {
LOG.warn("File " + inputPath + " is missing. Skipping it.");
continue;
}
throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via
wal.input.ignore.missing.files which is defaulted to false and never set.
Looking at the comment and reviewboard history of HBASE-14141 I think there
might have been some confusion about where to handle these missing files, and
this got lost in the shuffle.
I would prefer not to ignore missing hfiles. I think that could result in some
weird behavior:
* RegionServer has 10 archived and 30 not-yet-archived WALs needing to be
backed up
* The process starts, and while it's running 1 of those 30 WALs gets archived.
That would get skipped due to FileNotFoundException
* But the remaining 29 would be backed up
This scenario could cause some data consistency issues if this incremental
backup is restored. We missed some edits in the middle of applied edits from
other WALs.
So I do think failing as we do today is necessary for consistency, but
unrealistic in a live cluster. The solution is to try finding the missing file
in the archived directory. Backups has a coprocessor which will not allow the
archived file to be cleaned up until it's backed up, so I think it's safe to
say that a WAL is either definitely in WALs or oldWALs.
*
-
> Incremental backup can fail if log gets archived while WALPlayer is starting
> up
> -------------------------------------------------------------------------------
>
> Key: HBASE-28462
> URL: https://issues.apache.org/jira/browse/HBASE-28462
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Bryan Beaudreault
> Priority: Major
>
> We had incremental backup fail with FileNotFoundException for a file in the
> WALs directory. Upon investigation, the log had been archived a few mins
> earlier. WALInputFormat's record reader has support for falling back on an
> archived path:
> {code:java}
> } catch (IOException e) {
> Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
> // archivedLog can be null if unable to locate in archiveDir.
> if (archivedLog != null) {
> openReader(archivedLog);
> // Try call again in recursion
> return nextKeyValue();
> } else {
> throw e;
> }
> } {code}
> But the getSplits method has different handling:
> {code:java}
> try {
> List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
> allFiles.addAll(files);
> } catch (FileNotFoundException e) {
> if (ignoreMissing) {
> LOG.warn("File " + inputPath + " is missing. Skipping it.");
> continue;
> }
> throw e;
> } {code}
> This ignoreMissing variable was added in HBASE-14141 and is enabled via
> wal.input.ignore.missing.files which is defaulted to false and never set.
> Looking at the comment and reviewboard history of HBASE-14141 I think there
> might have been some confusion about where to handle these missing files, and
> this got lost in the shuffle.
>
> I would prefer not to ignore missing hfiles. I think that could result in
> some weird behavior:
> * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be
> backed up
> * The process starts, and while it's running 1 of those 30 WALs gets
> archived. That would get skipped due to FileNotFoundException
> * But the remaining 29 would be backed up
> This scenario could cause some data consistency issues if this incremental
> backup is restored. We missed some edits in the middle of applied edits from
> other WALs.
> So I do think failing as we do today is necessary for consistency, but
> unrealistic in a live cluster. The solution is to try finding the missing
> file in the archived directory. Backups has a coprocessor which will not
> allow the archived file to be cleaned up until it's backed up, so I think
> it's safe to say that a WAL is either definitely in WALs or oldWALs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)