[
https://issues.apache.org/jira/browse/HBASE-15940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315073#comment-15315073
]
Stephen Yuan Jiang edited comment on HBASE-15940 at 6/4/16 12:02 AM:
---------------------------------------------------------------------
The existing flow is: offline region, move some of the files in the overlapped
regions to a new region, then sidelined the old region directory. So any files
(eg. the old .regioninfo file) that are not moved will be sidelined. I don't
have to write any new code for this. the unmoved reference file is
automatically sidelined at the end of merge.
Yeah, I did see HBASE-15406 when I add the new admin.setCatalogJanitor(false)
code. If HBCK run is killed by Ctrl-C, the Catalog Janitor would be disabled
until manual enable. I think a Ctrl-C is not a common practice and not running
Catalog Janitor is not a big deal (at least to me, how often do we split/merge
regions and then run major compaction). I feel that using zookeeper to keep
track of catalog janitor enable/disable is a little bit overkill for this
problem. That is why I prefer just have a simple disable and enable in the
finally-block.
was (Author: syuanjiang):
The existing flow is: offline region, move some of the files in the overlapped
regions to a new region, then sidelined the old region directory. So any files
(eg. the old .regioninfo file) that are not moved will be sidelined. I don't
have to write any new code for this.
Yeah, I did see HBASE-15406 when I add the new admin.setCatalogJanitor(false)
code. If HBCK run is killed by Ctrl-C, the Catalog Janitor would be disabled
until manual enable. I think a Ctrl-C is not a common practice and not running
Catalog Janitor is not a big deal (at least to me, how often do we split/merge
regions and then run major compaction). I feel that using zookeeper to keep
track of catalog janitor enable/disable is a little bit overkill for this
problem. That is why I prefer just have a simple disable and enable in the
finally-block.
> HBCK unnecessary moves reference files when a table has split region to fix
> non-existing overlap regions
> --------------------------------------------------------------------------------------------------------
>
> Key: HBASE-15940
> URL: https://issues.apache.org/jira/browse/HBASE-15940
> Project: HBase
> Issue Type: Bug
> Components: hbck
> Affects Versions: 1.0.0
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Attachments: org.apache.hadoop.hbase.util.TestHBaseFsck-output.txt,
> repro-hbck-repair-healthy-splitted=region.patch, skipReferenceFiles.patch
>
>
> When repair option (the -fixHdfsOverlaps option specifically) is specified
> against a table, if the table has splitted regions (both parent region and
> child regions exists with reference files), Hbck would wrongly think that
> there exists overlapped regions and try to merge them and fix it.
> This is by-design, as current implementation of Hbck uses HDFS as the trusted
> source without consulting META table.
> Here is the comments from one of unit tests:
> {code}
> // TODO: fixHdfsHoles does not work against splits, since the parent
> dir lingers on
> // for some time until children references are deleted. HBCK
> erroneously sees this as
> // overlapping regions
> {code}
> However, this is undesirable. when the reference files moved to a new
> region, the parent region would have no daugher regions and hence it could be
> cleaned up by CatalogJanitor. This would create real inconsistency:
> lingering reference files.
> Another bad consequence is that we would merge splitted regions back to one.
> Even it is undesirable, at least this would not cause more inconsistency.
> this JIRA would not try to solve this unsplit issue, as it requires bigger
> design change in Hbck.
> This JIRA is trying to address the potential lingering reference files
> issue, as multiple customers using branch-1 faced this issue in production.
> (workaround is that run major compaction on all split regions before run
> HBCK, this could take longer time and have production impact).
> Attached is the log and modified unit test to repro the issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)