[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675799#comment-16675799
 ] 

Josh Elser commented on HBASE-21387:
------------------------------------

Ok, cool. Thanks for confirming, Ted.

I can appreciate where your fix is coming from, but I'm still suspect of this 
being a complete fix.

In the current implementation of SnapshotFileCache#getSnapshotsInProgress(..), 
we acquire the lock on the in-progress snapshot before listing the files for 
it. That means the call to getSnapshotsInProgress(..) will block until the 
operation is complete (both for online and offline snapshot generation). So, we 
should never have a case where we read a snapshot's files while it's in the 
process of being written.

However, it does seems like there could be a case where a snapshot we knew to 
be in-progress finishes before the SnapshotFileCleaner "wakes up" (e.g. after 
TakeSnapshotHandler.completeSnapshot(..) is invoked). We may miss this newly 
created snapshot (which was in-progress when we started 
SnapshotFileCache#getUnreferencedFiles but is complete when we finish it).

The bigger problem is that there is still the potential for the 
SnapshotFileCleaner to go to sleep during {{getSnapshotsInProgress}} (in the 
current code or your patch) and miss a newly started snapshot. I do not think 
we can safely identify the files to retain for snapshots without precluding the 
submission of new snapshots. I think your v8 patch improves this situation but 
does not completely solve it.

WDYT?



> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21387
>                 URL: https://issues.apache.org/jira/browse/HBASE-21387
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>              Labels: snapshot
>         Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>       if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to