Ted Yu created HBASE-21387:
------------------------------
Summary: Race condition in snapshot cache refreshing leads to loss
of snapshot files
Key: HBASE-21387
URL: https://issues.apache.org/jira/browse/HBASE-21387
Project: HBase
Issue Type: Bug
Reporter: Ted Yu
Assignee: Ted Yu
During recent report from customer where ExportSnapshot failed:
{code}
2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2]
snapshot.SnapshotReferenceUtil: Can't find hfile:
44f6c3c646e84de6a63fe30da4fcb3aa in the real
(hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
or archive
(hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
directory for the primary table.
{code}
We found the following in log:
{code}
2018-10-09 18:54:23,675 DEBUG
[00:16000.activeMasterManager-HFileCleaner.large-1539035367427]
cleaner.HFileCleaner: Removing:
hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa
from archive
{code}
The root cause is race condition surrounding SnapshotFileCache#refreshCache().
There are two callers of refreshCache: one from RefreshCacheTask#run and the
other from SnapshotHFileCleaner.
Let's look at the code of refreshCache:
{code}
// if the snapshot directory wasn't modified since we last check, we are
done
if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;
// 1. update the modified time
this.lastModifiedTime = dirStatus.getModificationTime();
// 2.clear the cache
this.cache.clear();
{code}
Suppose the RefreshCacheTask runs past the if check and sets
this.lastModifiedTime
The cleaner executes refreshCache and returns immediately since
this.lastModifiedTime matches the modification time of the directory.
Now RefreshCacheTask clears the cache. By the time the cleaner performs cache
lookup, the cache is empty.
Therefore cleaner puts the file into unReferencedFiles - leading to data loss.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)