[
https://issues.apache.org/jira/browse/HBASE-27043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaolin Ha updated HBASE-27043:
-------------------------------
Fix Version/s: 2.5.0
3.0.0-alpha-3
> Let lock wait timeout to improve performance of SnapshotHFileCleaner
> --------------------------------------------------------------------
>
> Key: HBASE-27043
> URL: https://issues.apache.org/jira/browse/HBASE-27043
> Project: HBase
> Issue Type: Improvement
> Components: snapshots
> Affects Versions: 2.5.0, 3.0.0-alpha-3
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
> Attachments: clearner-before-and-after.png, namenode-callqueue.png
>
>
> Currently, hfile cleaner uses the dir scanning threads to get deletable
> files, by checking all the files under the scanned directories through the
> cleaner chain. And before scanning a directory, cleaner sorted the
> subdirectories by consumed spaces, but we all know getContentSummary is a
> time consuming operation for HDFS.
> SnapshotHFileCleaner filters all the unreferenced files of snapshots to
> delete, and it tries to get write lock of SnapshotManager#takingSnapshotLock
> before determining the deletable files.
> [https://github.com/apache/hbase/blob/ad64a9baae2ef8ee56aa3ed6b96cb3d51f5daf0a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java#L195]
> But when there is any snapshot taking and the cleaner gets the lock failed,
> all the scanned files will be determined to be none-deletable, and all the
> dir scanning threads will scan and getContentSummary of other dirs(no files
> are deletable too) one by one until the snapshot taking lock is released.
> This is a low efficiency behavior, we should let the clear wait the lock to
> determine if the files it currently hold are deletable, instead of the
> meaningless scanning and getContentSummary other/new directories while the
> lock is acquired by taking snapshot operations.
> I deployed this optimization in our production environment, and the effect is
> very obvious.
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)