[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671991#comment-16671991
 ] 

Josh Elser commented on HBASE-21387:
------------------------------------

{quote}
The root cause is race condition surrounding SnapshotFileCache#refreshCache().
 There are two callers of refreshCache: ... and the other from 
SnapshotHFileCleaner.
{quote}

I don't see any call to refreshCache() by SnapshotHFileCleaner (only a call to 
{{getUnreferencedFiles()}}). {{getUnreferencedFiles()}} calls 
{{refreshCache()}} -- is that what you mean?

{quote}
Suppose the RefreshCacheTask runs past the if check and sets 
this.lastModifiedTime
The cleaner executes refreshCache and returns immediately since 
this.lastModifiedTime matches the modification time of the directory.
Now RefreshCacheTask clears the cache. By the time the cleaner performs cache 
lookup, the cache is empty.
{quote}

This doesn't appear to be possible to me because both 
{{getUnreferencedFiles()}} and {{refreshCache()}} are marked {{synchronized}}. 
This means that it should be impossible to have two callers inside either of 
these methods at the same time. Am I missing something?

Given the above and looking at the patch, that would mean that change which 
actually makes an effect is:
{code}
+      // 5. update the modified time
+      this.lastModifiedTime = dirStatus.getModificationTime();
{code}

However, it's not immediately obvious to me why this is necessary or why it 
makes your manual test pass.

> Race condition in snapshot cache refreshing leads to loss of snapshot files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-21387
>                 URL: https://issues.apache.org/jira/browse/HBASE-21387
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>              Labels: snapshot
>         Attachments: 21387.v1.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding SnapshotFileCache#refreshCache().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>     // if the snapshot directory wasn't modified since we last check, we are 
> done
>     if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;
>     // 1. update the modified time
>     this.lastModifiedTime = dirStatus.getModificationTime();
>     // 2.clear the cache
>     this.cache.clear();
> {code}
> Suppose the RefreshCacheTask runs past the if check and sets 
> this.lastModifiedTime
> The cleaner executes refreshCache and returns immediately since 
> this.lastModifiedTime matches the modification time of the directory.
> Now RefreshCacheTask clears the cache. By the time the cleaner performs cache 
> lookup, the cache is empty.
> Therefore cleaner puts the file into unReferencedFiles - leading to data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to