[
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell updated HBASE-21387:
-----------------------------------
Fix Version/s: (was: 1.5.0)
> Race condition surrounding in progress snapshot handling in snapshot cache
> leads to loss of snapshot files
> ----------------------------------------------------------------------------------------------------------
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Reporter: Ted Yu
> Assignee: Ted Yu
> Priority: Major
> Labels: snapshot
> Fix For: 3.0.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt,
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt,
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt,
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch,
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch,
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch,
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2]
> snapshot.SnapshotReferenceUtil: Can't find hfile:
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
> or archive
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
> directory for the primary table.
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427]
> cleaner.HFileCleaner: Removing:
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
> if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that
> lastModifiedTime is up to date. So cleaner proceeds to check in progress
> snapshot(s). However, the snapshot has completed by that time, resulting in
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a
> snapshot S1 in progress that is referencing a file F1. refreshCache() is
> called, but no completed snapshot references F1. At T2, the snapshot S1,
> which references F1, completes. At T3, we check in-progress snapshots and S1
> is not included. Thus, F1 is marked as unreferenced even though S1 references
> it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)