Xiaolin Ha created HBASE-27043:
----------------------------------

             Summary: Let lock wait timeout to improve performance of 
SnapshotHFileCleaner
                 Key: HBASE-27043
                 URL: https://issues.apache.org/jira/browse/HBASE-27043
             Project: HBase
          Issue Type: Improvement
          Components: snapshots
    Affects Versions: 2.5.0, 3.0.0-alpha-3
            Reporter: Xiaolin Ha
            Assignee: Xiaolin Ha
         Attachments: clearner-before-and-after.png, namenode-callqueue.png

Currently, hfile cleaner uses the dir scanning threads to get deletable files, 
by checking all the files under the scanned directories through the cleaner 
chain. And before scanning a directory, cleaner sorted the subdirectories by 
consumed spaces, but  we all know getContentSummary is a time consuming 
operation for HDFS.

SnapshotHFileCleaner filters all the unreferenced files of snapshots to delete, 
and it tries to get write lock of SnapshotManager#takingSnapshotLock before 
determining the deletable files. 
[https://github.com/apache/hbase/blob/ad64a9baae2ef8ee56aa3ed6b96cb3d51f5daf0a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java#L195]

But when there is any snapshot taking and the cleaner gets the lock failed, all 
the scanned files will be determined to be none-deletable, and all the dir 
scanning threads will scan and getContentSummary of other dirs(no files are 
deletable too) one by one until the snapshot taking lock is released.

This is a low efficiency behavior, we should let the clear wait the lock to 
determine if the files it currently hold are deletable, instead of the 
meaningless scanning and getContentSummary other/new directories while the lock 
is acquired by taking snapshot operations.

I deployed this optimization in our production environment, and the effect is 
very obvious.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to