[
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386372#comment-15386372
]
Ben Lau commented on HBASE-16052:
---------------------------------
Looks like some of the methods don't exist on Hadoop 1.1. When I ran tests
locally it was against Hadoop 2.X. Will fix.
> Improve HBaseFsck Scalability
> -----------------------------
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
> Issue Type: Improvement
> Components: hbck
> Reporter: Ben Lau
> Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch,
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch,
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow
> especially for large tables or clusters with many regions.
> This patch tries to fix the biggest bottlenecks and also include a couple of
> bug fixes for some of the race conditions caused by gathering and holding
> state about a live cluster that is no longer true by the time you use that
> state in Fsck processing. These race conditions cause Fsck to crash and
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and
> then discarding everything but the Paths, then passing the Paths to a
> PathFilter, and then having the filter look up the (previously discarded)
> FileStatuses of the paths again. This is actually worse than double I/O
> because the first lookup obtains a batch of FileStatuses while all the other
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent
> things like snapshots, hfile archival, etc. I didn't have time to look too
> deep into other things affected and didn't want to increase the scope of this
> ticket so I focus mostly on Fsck and make only a few improvements to other
> codepaths. The changes in this patch though should make it fairly easy to
> fix other code paths in later jiras if we feel there are some other features
> strongly impacted by this problem.
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of
> Fsck runtime) and the running time scales with the number of store files, yet
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big
> bottleneck if you have 1 large cluster with 1 very large table that has
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large
> clusters with a few very large tables. On our version of 0.98 with the
> original patch we had a moderately sized production cluster with 2 (user)
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)