[
https://issues.apache.org/jira/browse/HBASE-29149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042703#comment-18042703
]
David commented on HBASE-29149:
-------------------------------
I took a closer look at the issue and found several code locations that may be
relevant.
The root cause is the lack of retry/archive-lookup logic in the backup client's
WAL-to-HFile conversion ({{{}convertWALsToHFiles(){}}} invoked at line 290),
combined with uncoordinated archiving by {{ProcedureWALFile.removeFile}} (lines
164-165).
*Relevant area:*
# Function {{removeFile}} in
{{hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFile.java}}
([lines
160-174|https://github.com/apache/hbase/blob/6d342cc2e0ca0a6f468aad635435cc835bdae7dc/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/ProcedureWALFile.java#L160]):
Lines 164-165 move the WAL out of its original directory, racing with any
readers.
# Function {{execute}} in
{{hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalTableBackupClient.java}}
([lines
306-318|https://github.com/apache/hbase/blob/6d342cc2e0ca0a6f468aad635435cc835bdae7dc/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalTableBackupClient.java#L311]):
Line 311 opens WALs from the original path. When the file was archived, this
throws FNFE, caught at lines 314-315, aborting the backup. No retry or archive
fallback logic present.
*Suggested approach:*
{code:java}
try {
convertWALsToHFiles();
} catch (FileNotFoundException fnfe) {
if (backupInfo.getLogArchiveDir() != null) {
LOG.warn(\"WAL not found, retrying from archive\", fnfe);
convertWALsToHFilesUsingArchive(backupInfo.getLogArchiveDir());
} else {
throw fnfe;
}
}{code}
Happy to adjust this if I missed anything.
> WAL files can be archived during incremental backup process
> -----------------------------------------------------------
>
> Key: HBASE-29149
> URL: https://issues.apache.org/jira/browse/HBASE-29149
> Project: HBase
> Issue Type: Bug
> Reporter: Hernan Gelaf-Romer
> Assignee: Hernan Gelaf-Romer
> Priority: Major
>
> At my job, we've run into FNFE issues when WAL files are archived as they are
> being loaded to be converted into HFiles. When looking at the failure logs,
> we can see that the WAL was loaded just after the archive had occurred
> server-side.
>
> {quote}2025-02-24 17:10:34.333 [pool-124-thread-1] ERROR
> o.a.h.h.b.impl.TableBackupClient - Unexpected exception in
> incremental-backup: incremental copy backup_1740417014671File
> hdfs://nestor-hb2-a-qa:8020/hbase/WALs/na1-purple-dizzy-antelope.iad03.hubinternal.net,60020,1739996267893/na1-purple-dizzy-antelope.iad03.hubinternal.net%2C60020%2C1739996267893.1740412909549
> does not exist.
> java.io.FileNotFoundException: File
> hdfs://nestor-hb2-a-qa:8020/hbase/WALs/na1-purple-dizzy-antelope.iad03.hubinternal.net,60020,1739996267893/na1-purple-dizzy-antelope.iad03.hubinternal.net%2C60020%2C1739996267893.1740412909549
> does not exist.
> {quote}
>
> {quote}2025-02-24 17:10:17.787 Archiving
> hdfs://nestor-hb2-a-qa:8020/hbase/WALs/na1-purple-dizzy-antelope.iad03.hubinternal.net,60020,1739996267893/na1-purple-dizzy-antelope.iad03.hubinternal.net%2C60020%2C1739996267893.1740412909549
> to
> hdfs://nestor-hb2-a-qa:8020/hbase/oldWALs/na1-purple-dizzy-antelope.iad03.hubinternal.net%2C60020%2C1739996267893.1740412909549
> {quote}
>
> We already handle a similar situation when loading bulkloads, and add a
> re-try mechanism that checks the archive directory. We should probably do a
> similar thing here
--
This message was sent by Atlassian Jira
(v8.20.10#820010)