[ 
https://issues.apache.org/jira/browse/HBASE-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036525#comment-14036525
 ] 

churro morales commented on HBASE-11322:
----------------------------------------

Hi Lars, 

The fix is correct, but the fix revealed the HBASE-11360 bug.

Suppose the following happens:
1. We take a snapshot
2. It creates /hbase/.hbase-snapshot/.tmp/<someSnapshot> directory and starts 
populating files
3. We have a cleaner run the cache refreshes notices the .tmp directory has 
changed and changes the lastModified to that value
4. A region server is snapshotting and right after it finishes it finishes 
compacting some store files too.
5. Next time the cleaner runs it will not refresh the cache as the timestamp 
for the .tmp directory has not changed but files have been added since the last 
run.
6. Cleaner deletes the archived HFile in question and snapshot fails. 

With the old code, this probably wasn't noticed as much as the snapshot and the 
.tmp directories had different timestamps and the logic was incorrect in 
lastModified and thus a cache refresh always happened.

We have been running with the cleaner logic for about a week and its fine (its 
no longer slow), but to get around this issue after snapshotting: 
We touched a file and removed from the .tmp directory to force the cache 
refresh.   
5. 

> SnapshotHFileCleaner makes the wrong check for lastModified time thus causing 
> too many cache refreshes
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-11322
>                 URL: https://issues.apache.org/jira/browse/HBASE-11322
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.19
>            Reporter: churro morales
>            Assignee: churro morales
>            Priority: Critical
>             Fix For: 0.94.21
>
>         Attachments: 11322.94.txt, HBASE-11322.patch
>
>
> The SnapshotHFileCleaner calls the SnapshotFileCache if a particular HFile in 
> question is part of a snapshot.
> If the HFile is not in the cache, we then refresh the cache and check again.
> But the cache refresh checks to see if anything has been modified since the 
> last cache refresh but this logic is incorrect in certain scenarios.
> The last modified time is done via this operation:
> {code}
> this.lastModifiedTime = Math.min(dirStatus.getModificationTime(),
>                                      tempStatus.getModificationTime());
> {code}
> and the check to see if the snapshot directories have been modified:
> {code}
> // if the snapshot directory wasn't modified since we last check, we are done
>     if (dirStatus.getModificationTime() <= lastModifiedTime &&
>         tempStatus.getModificationTime() <= lastModifiedTime) {
>       return;
>     }
> {code}
> Suppose the following happens:
> dirStatus modified 6-1-2014
> tempStatus modified 6-2-2014
> lastModifiedTime = 6-1-2014
> provided these two directories don't get modified again all subsequent checks 
> wont exit early, like they should.
> In our cluster, this was a huge performance hit.  The cleaner chain fell 
> behind, thus almost filling up dfs and our namenode heap.
> Its a simple fix, instead of Math.min we use Math.max for the lastModified, I 
> believe that will be correct.
> I'll apply a patch for you guys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to